Hey Marcelo,

You should check out spark.  It intelligently deals with a lot of the
issues you're mentioning.  Al Tobey did a walkthrough of how to set up
the OSS side of things here:
http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html

It'll be less work than writing a M/R framework from scratch :)
Jon


On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
<marc...@s1mbi0se.com.br> wrote:
> Hi,
>
> I have the need to executing a map/reduce job to identity data stored in
> Cassandra before indexing this data to Elastic Search.
>
> I have already used ColumnFamilyInputFormat (before start using CQL) to
> write hadoop jobs to do that, but I use to have a lot of troubles to perform
> tunning, as hadoop depends on how map tasks are split in order to
> successfull execute things in parallel, for IO/bound processes.
>
> First question is: Am I the only one having problems with that? Is anyone
> else using hadoop jobs that reads from Cassandra in production?
>
> Second question is about the alternatives. I saw new version spark will have
> Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to
> use HIVE with Cassandra community, but it seems it only works with Cassandra
> Enterprise and doesn't do more than FB presto (http://prestodb.io/), which
> we have been using reading from Cassandra and so far it has been great for
> SQL-like queries. For custom map reduce jobs, however, it is not enough.
>
> Does anyone know some other tool that performs MR on Cassandra? My
> impression is most tools were created to work on top of HDFS and reading
> from a nosql db is some kind of "workaround".
>
> Third question is about how these tools work. Most of them writtes mapped
> data on a intermediate storage, then data is shuffled and sorted, then it is
> reduced. Even when using CqlPagingInputFormat, if you are using hadoop it
> will write files to HDFS after the mapping phase, shuffle and sort this
> data, and then reduce it.
>
> I wonder if a tool supporting Cassandra out of the box wouldn't be smarter.
> Is it faster to write all your data to a file and then sorting it, or batch
> inserting data and already indexing it, as it happens when you store data in
> a Cassandra CF? I didn't do the calculations to check the complexity of each
> one, what should consider no index in Cassandra would be really large, as
> the maximum index size will always depend on the maximum capacity of a
> single host, but my guess is that a map / reduce tool written specifically
> to Cassandra, from the beggining, could perform much better than a tool
> written to HDFS and adapted. I hear people saying Map/Reduce on
> Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make
> sense? Should we expect a result like this?
>
> Final question: Do you think writting a new M/R tool like described would be
> reinventing the wheel? Or it makes sense?
>
> Thanks in advance. Any opinions about this subject will be very appreciated.
>
> Best regards,
> Marcelo Valle.



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Reply via email to