subject:"map reduce for Cassandra"

map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi,

I have the need to executing a map/reduce job to identity data stored in
Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL) to
write hadoop jobs to do that, but I use to have a lot of troubles to
perform tunning, as hadoop depends on how map tasks are split in order to
successfull execute things in parallel, for IO/bound processes.

First question is: Am I the only one having problems with that? Is anyone
else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark will
have Cassandra support, but using CqlPagingInputFormat, from hadoop. I
tried to use HIVE with Cassandra community, but it seems it only works with
Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/),
which we have been using reading from Cassandra and so far it has been
great for SQL-like queries. For custom map reduce jobs, however, it is not
enough.

Does anyone know some other tool that performs MR on Cassandra? My
impression is most tools were created to work on top of HDFS and reading
from a nosql db is some kind of workaround.

Third question is about how these tools work. Most of them writtes mapped
data on a intermediate storage, then data is shuffled and sorted, then it
is reduced. Even when using CqlPagingInputFormat, if you are using hadoop
it will write files to HDFS after the mapping phase, shuffle and sort this
data, and then reduce it.

I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
Is it faster to write all your data to a file and then sorting it, or batch
inserting data and already indexing it, as it happens when you store data
in a Cassandra CF? I didn't do the calculations to check the complexity of
each one, what should consider no index in Cassandra would be really large,
as the maximum index size will always depend on the maximum capacity of a
single host, but my guess is that a map / reduce tool written specifically
to Cassandra, from the beggining, could perform much better than a tool
written to HDFS and adapted. I hear people saying Map/Reduce on
Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described would
be reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very appreciated.

Best regards,
Marcelo Valle.

Re: map reduce for Cassandra

2014-07-21 Thread Jonathan Haddad

Hey Marcelo,

You should check out spark.  It intelligently deals with a lot of the
issues you're mentioning.  Al Tobey did a walkthrough of how to set up
the OSS side of things here:
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

It'll be less work than writing a M/R framework from scratch :)
Jon


On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 Hi,

 I have the need to executing a map/reduce job to identity data stored in
 Cassandra before indexing this data to Elastic Search.

 I have already used ColumnFamilyInputFormat (before start using CQL) to
 write hadoop jobs to do that, but I use to have a lot of troubles to perform
 tunning, as hadoop depends on how map tasks are split in order to
 successfull execute things in parallel, for IO/bound processes.

 First question is: Am I the only one having problems with that? Is anyone
 else using hadoop jobs that reads from Cassandra in production?

 Second question is about the alternatives. I saw new version spark will have
 Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to
 use HIVE with Cassandra community, but it seems it only works with Cassandra
 Enterprise and doesn't do more than FB presto (http://prestodb.io/), which
 we have been using reading from Cassandra and so far it has been great for
 SQL-like queries. For custom map reduce jobs, however, it is not enough.

 Does anyone know some other tool that performs MR on Cassandra? My
 impression is most tools were created to work on top of HDFS and reading
 from a nosql db is some kind of workaround.

 Third question is about how these tools work. Most of them writtes mapped
 data on a intermediate storage, then data is shuffled and sorted, then it is
 reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
 will write files to HDFS after the mapping phase, shuffle and sort this
 data, and then reduce it.

 I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
 Is it faster to write all your data to a file and then sorting it, or batch
 inserting data and already indexing it, as it happens when you store data in
 a Cassandra CF? I didn't do the calculations to check the complexity of each
 one, what should consider no index in Cassandra would be really large, as
 the maximum index size will always depend on the maximum capacity of a
 single host, but my guess is that a map / reduce tool written specifically
 to Cassandra, from the beggining, could perform much better than a tool
 written to HDFS and adapted. I hear people saying Map/Reduce on
 Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
 sense? Should we expect a result like this?

 Final question: Do you think writting a new M/R tool like described would be
 reinventing the wheel? Or it makes sense?

 Thanks in advance. Any opinions about this subject will be very appreciated.

 Best regards,
 Marcelo Valle.



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi Jonathan,

Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
will be supported just in the next version, but I would like to be wrong...

Best regards,
Marcelo Valle.



2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Hey Marcelo,

 You should check out spark.  It intelligently deals with a lot of the
 issues you're mentioning.  Al Tobey did a walkthrough of how to set up
 the OSS side of things here:

 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

 It'll be less work than writing a M/R framework from scratch :)
 Jon


 On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi,
 
  I have the need to executing a map/reduce job to identity data stored in
  Cassandra before indexing this data to Elastic Search.
 
  I have already used ColumnFamilyInputFormat (before start using CQL) to
  write hadoop jobs to do that, but I use to have a lot of troubles to
 perform
  tunning, as hadoop depends on how map tasks are split in order to
  successfull execute things in parallel, for IO/bound processes.
 
  First question is: Am I the only one having problems with that? Is anyone
  else using hadoop jobs that reads from Cassandra in production?
 
  Second question is about the alternatives. I saw new version spark will
 have
  Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
 to
  use HIVE with Cassandra community, but it seems it only works with
 Cassandra
  Enterprise and doesn't do more than FB presto (http://prestodb.io/),
 which
  we have been using reading from Cassandra and so far it has been great
 for
  SQL-like queries. For custom map reduce jobs, however, it is not enough.
 
  Does anyone know some other tool that performs MR on Cassandra? My
  impression is most tools were created to work on top of HDFS and reading
  from a nosql db is some kind of workaround.
 
  Third question is about how these tools work. Most of them writtes mapped
  data on a intermediate storage, then data is shuffled and sorted, then
 it is
  reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
  will write files to HDFS after the mapping phase, shuffle and sort this
  data, and then reduce it.
 
  I wonder if a tool supporting Cassandra out of the box wouldn't be
 smarter.
  Is it faster to write all your data to a file and then sorting it, or
 batch
  inserting data and already indexing it, as it happens when you store
 data in
  a Cassandra CF? I didn't do the calculations to check the complexity of
 each
  one, what should consider no index in Cassandra would be really large, as
  the maximum index size will always depend on the maximum capacity of a
  single host, but my guess is that a map / reduce tool written
 specifically
  to Cassandra, from the beggining, could perform much better than a tool
  written to HDFS and adapted. I hear people saying Map/Reduce on
  Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
 make
  sense? Should we expect a result like this?
 
  Final question: Do you think writting a new M/R tool like described
 would be
  reinventing the wheel? Or it makes sense?
 
  Thanks in advance. Any opinions about this subject will be very
 appreciated.
 
  Best regards,
  Marcelo Valle.



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade

Re: map reduce for Cassandra

2014-07-21 Thread Jonathan Haddad

I haven't tried pyspark yet, but it's part of the distribution.  My
main language is Python too, so I intend on getting deep into it.

On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
 Hi Jonathan,

 Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
 will be supported just in the next version, but I would like to be wrong...

 Best regards,
 Marcelo Valle.



 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

 Hey Marcelo,

 You should check out spark.  It intelligently deals with a lot of the
 issues you're mentioning.  Al Tobey did a walkthrough of how to set up
 the OSS side of things here:

 http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

 It'll be less work than writing a M/R framework from scratch :)
 Jon


 On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
 marc...@s1mbi0se.com.br wrote:
  Hi,
 
  I have the need to executing a map/reduce job to identity data stored in
  Cassandra before indexing this data to Elastic Search.
 
  I have already used ColumnFamilyInputFormat (before start using CQL) to
  write hadoop jobs to do that, but I use to have a lot of troubles to
  perform
  tunning, as hadoop depends on how map tasks are split in order to
  successfull execute things in parallel, for IO/bound processes.
 
  First question is: Am I the only one having problems with that? Is
  anyone
  else using hadoop jobs that reads from Cassandra in production?
 
  Second question is about the alternatives. I saw new version spark will
  have
  Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
  to
  use HIVE with Cassandra community, but it seems it only works with
  Cassandra
  Enterprise and doesn't do more than FB presto (http://prestodb.io/),
  which
  we have been using reading from Cassandra and so far it has been great
  for
  SQL-like queries. For custom map reduce jobs, however, it is not enough.
 
  Does anyone know some other tool that performs MR on Cassandra? My
  impression is most tools were created to work on top of HDFS and reading
  from a nosql db is some kind of workaround.
 
  Third question is about how these tools work. Most of them writtes
  mapped
  data on a intermediate storage, then data is shuffled and sorted, then
  it is
  reduced. Even when using CqlPagingInputFormat, if you are using hadoop
  it
  will write files to HDFS after the mapping phase, shuffle and sort this
  data, and then reduce it.
 
  I wonder if a tool supporting Cassandra out of the box wouldn't be
  smarter.
  Is it faster to write all your data to a file and then sorting it, or
  batch
  inserting data and already indexing it, as it happens when you store
  data in
  a Cassandra CF? I didn't do the calculations to check the complexity of
  each
  one, what should consider no index in Cassandra would be really large,
  as
  the maximum index size will always depend on the maximum capacity of a
  single host, but my guess is that a map / reduce tool written
  specifically
  to Cassandra, from the beggining, could perform much better than a tool
  written to HDFS and adapted. I hear people saying Map/Reduce on
  Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
  make
  sense? Should we expect a result like this?
 
  Final question: Do you think writting a new M/R tool like described
  would be
  reinventing the wheel? Or it makes sense?
 
  Thanks in advance. Any opinions about this subject will be very
  appreciated.
 
  Best regards,
  Marcelo Valle.



 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade





-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Jonathan,

By what I have read in the docs, Python API has some limitations yet, not
being possible to use any hadoop binary input format.

The python example for Cassandra is only in the master branch:
https://github.com/apache/spark/blob/master/examples/src/main/python/cassandra_inputformat.py

I may be lacking knowledge of Spark, but if I understood it correctly, the
access to Cassandra data is still made by the CqlPagingInputFormat, from
hadoop integration.

Here is where I ask: even if Spark supports Cassandra, will it be fast
enough?

My understanding (please some correct me if I am wrong) is that when you
insert N items in a Cassandra CF, you are executing N binary searches to
insert the item already indexed by a key. When you read the data, it's
already sorted. So you take O(N * log(N)) (binary search complexity to
insert all data already sorted.

However, by using a fast sort algorithm, you also take O(N * log(N)) to
sort the data after ir was inserted, but then using more IO.

If I write a job in Spark / Java with Cassandra, how will the mapped data
be stored and sorted? Will it be stored in Cassandra too? Will spark run
sort after the mapping?

Best regards,
Marcelo.

2014-07-21 14:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

I haven't tried pyspark yet, but it's part of the distribution. My
main language is Python too, so I intend on getting deep into it.

On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
Hi Jonathan,

Do you know if this RDD can be used with Python? AFAIK, python +
Cassandra
will be supported just in the next version, but I would like to be
wrong...

Best regards,
Marcelo Valle.

2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com:

Hey Marcelo,

You should check out spark. It intelligently deals with a lot of the
issues you're mentioning. Al Tobey did a walkthrough of how to set up
the OSS side of things here:

http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

It'll be less work than writing a M/R framework from scratch :)
Jon

On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
marc...@s1mbi0se.com.br wrote:
Hi,

I have the need to executing a map/reduce job to identity data stored
in
Cassandra before indexing this data to Elastic Search.

I have already used ColumnFamilyInputFormat (before start using CQL)
to
write hadoop jobs to do that, but I use to have a lot of troubles to
perform
tunning, as hadoop depends on how map tasks are split in order to
successfull execute things in parallel, for IO/bound processes.

First question is: Am I the only one having problems with that? Is
anyone
else using hadoop jobs that reads from Cassandra in production?

Second question is about the alternatives. I saw new version spark
will
have
Cassandra support, but using CqlPagingInputFormat, from hadoop. I
tried
to
use HIVE with Cassandra community, but it seems it only works with
Cassandra
Enterprise and doesn't do more than FB presto (http://prestodb.io/),
which
we have been using reading from Cassandra and so far it has been great
for
SQL-like queries. For custom map reduce jobs, however, it is not
enough.

Does anyone know some other tool that performs MR on Cassandra? My
impression is most tools were created to work on top of HDFS and
reading
from a nosql db is some kind of workaround.

Third question is about how these tools work. Most of them writtes
mapped
data on a intermediate storage, then data is shuffled and sorted, then
it is
reduced. Even when using CqlPagingInputFormat, if you are using hadoop
it
will write files to HDFS after the mapping phase, shuffle and sort
this
data, and then reduce it.

I wonder if a tool supporting Cassandra out of the box wouldn't be
smarter.
Is it faster to write all your data to a file and then sorting it, or
batch
inserting data and already indexing it, as it happens when you store
data in
a Cassandra CF? I didn't do the calculations to check the complexity
of
each
one, what should consider no index in Cassandra would be really large,
as
the maximum index size will always depend on the maximum capacity of a
single host, but my guess is that a map / reduce tool written
specifically
to Cassandra, from the beggining, could perform much better than a
tool
written to HDFS and adapted. I hear people saying Map/Reduce on
Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
make
sense? Should we expect a result like this?

Final question: Do you think writting a new M/R tool like described
would be
reinventing the wheel? Or it makes sense?

Thanks in advance. Any opinions about this subject will be very
appreciated.

Best regards,
Marcelo Valle.

--
Jon Haddad
http

Re: map reduce for Cassandra

2014-07-21 Thread Robert Coli

On Mon, Jul 21, 2014 at 10:54 AM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 My understanding (please some correct me if I am wrong) is that when you
 insert N items in a Cassandra CF, you are executing N binary searches to
 insert the item already indexed by a key. When you read the data, it's
 already sorted. So you take O(N * log(N)) (binary search complexity to
 insert all data already sorted.


You're wrong, unless you're talking about insertion into a memtable, which
you probably aren't and which probably doesn't actually work that way
enough to be meaningful.

On disk, Cassandra has immutable datafiles, from which row fragments are
merged into a row at read time. I'm pretty sure the rest of the stuff you
said doesn't make any sense in light of this?

=Rob

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi Robert,

First of all, thanks for answering.


2014-07-21 20:18 GMT-03:00 Robert Coli rc...@eventbrite.com:

 You're wrong, unless you're talking about insertion into a memtable, which
 you probably aren't and which probably doesn't actually work that way
 enough to be meaningful.

 On disk, Cassandra has immutable datafiles, from which row fragments are
 merged into a row at read time. I'm pretty sure the rest of the stuff you
 said doesn't make any sense in light of this?


Although several sstables (disk fragments) may have the same row key,
inside a single sstable row keys and column keys are indexed, right?
Otherwise, doing a GET in Cassandra would take some time.
From the M/R perspective, I was reffering to the mem table, as I am trying
to compare the time to insert in Cassandra against the time of sorting in
hadoop.

To make it more clear: hadoop has it's own partitioner, which is used after
the map phase. The map output is written locally on each hadoop node, then
it's shuffled from one node to the other (see slide 17 in this
presentation: http://pt.slideshare.net/j_singh/nosql-and-mapreduce). In
other words, you may read Cassandra data on hadoop, but the intermediate
results are still stored in HDFS.

Instead of using hadoop partitioner, I would like to store the intermediate
results in a Cassandra CF, so the map output would go directly to an
intermediate column family via batch inserts, instead of being written to a
local disk first, then shuffled to the right node.

Therefore, the mapper would write it's output the same way all data enters
in Cassandra: first on a memtable, then being flush to a sstable, then read
during the reduce phase.

Shouldn't it be faster than storing intermediate results in HDFS?

Best regards,
Marcelo.

Re: map reduce for Cassandra

2014-07-21 Thread Robert Coli

On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle 
marc...@s1mbi0se.com.br wrote:

 Although several sstables (disk fragments) may have the same row key,
 inside a single sstable row keys and column keys are indexed, right?
 Otherwise, doing a GET in Cassandra would take some time.
 From the M/R perspective, I was reffering to the mem table, as I am trying
 to compare the time to insert in Cassandra against the time of sorting in
 hadoop.


I was confused, because unless you are using new in-memory
columnfamilies, which I believe are only available in DSE, there is no way
to ensure that any given row stays in a memtable. Very rarely is there a
view of the function of a memtable that only cares about its properties and
not the closely related properties of SSTables. However yours is one of
them, I see now why your question makes sense, you only care about the
memtable for how quickly it sorts.

But if you are only relying on memtables to sort writes, that seems like a
pretty heavyweight reason to use Cassandra?

I'm certainly not an expert in this area of Cassandra... but Cassandra, as
a datastore with immutable data files, is not typically a good choice for
short lived intermediate result sets... are you planning to use DSE?

=Rob

Re: map reduce for Cassandra

2014-07-21 Thread Marcelo Elias Del Valle

Hi,


 But if you are only relying on memtables to sort writes, that seems like a
 pretty heavyweight reason to use Cassandra?


Actually, it's not a reason to use Cassandra. I already use Cassandra and I
need to map reduce data from it. I am trying to see a reason to use the
conventional M/R tools or to build a tool specific to Cassandra.

but Cassandra, as a datastore with immutable data files, is not typically a
 good choice for short lived intermediate result sets...


Indeed, but so far I am seeing it as the best option. I storing this
intermediate files in HDFS is better, then I agree there is no reason to
consider Cassandra to do it.

are you planning to use DSE?


Our company will probably hire DSE support when it reaches some size, but
DSE as a product doesn't seem interesting to our case so far. The only tool
that would help be at this moment would be HIVE, but honestly I didn't like
the way DSE supports hive and I don't want to use a solution not available
to DSC (see
http://stackoverflow.com/questions/23959169/problems-using-hive-cassandra-community
for details).

[]s



2014-07-21 22:09 GMT-03:00 Robert Coli rc...@eventbrite.com:

 On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle 
 marc...@s1mbi0se.com.br wrote:

 Although several sstables (disk fragments) may have the same row key,
 inside a single sstable row keys and column keys are indexed, right?
 Otherwise, doing a GET in Cassandra would take some time.
 From the M/R perspective, I was reffering to the mem table, as I am
 trying to compare the time to insert in Cassandra against the time of
 sorting in hadoop.


 I was confused, because unless you are using new in-memory
 columnfamilies, which I believe are only available in DSE, there is no way
 to ensure that any given row stays in a memtable. Very rarely is there a
 view of the function of a memtable that only cares about its properties and
 not the closely related properties of SSTables. However yours is one of
 them, I see now why your question makes sense, you only care about the
 memtable for how quickly it sorts.

 But if you are only relying on memtables to sort writes, that seems like a
 pretty heavyweight reason to use Cassandra?

 I'm certainly not an expert in this area of Cassandra... but Cassandra, as
 a datastore with immutable data files, is not typically a good choice for
 short lived intermediate result sets... are you planning to use DSE?

 =Rob

33 matches

Mail list logo