I haven't tried pyspark yet, but it's part of the distribution.  My
main language is Python too, so I intend on getting deep into it.

On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle
<marc...@s1mbi0se.com.br> wrote:
> Hi Jonathan,
>
> Do you know if this RDD can be used with Python? AFAIK, python + Cassandra
> will be supported just in the next version, but I would like to be wrong...
>
> Best regards,
> Marcelo Valle.
>
>
>
> 2014-07-21 13:06 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>:
>
>> Hey Marcelo,
>>
>> You should check out spark.  It intelligently deals with a lot of the
>> issues you're mentioning.  Al Tobey did a walkthrough of how to set up
>> the OSS side of things here:
>>
>> http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html
>>
>> It'll be less work than writing a M/R framework from scratch :)
>> Jon
>>
>>
>> On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle
>> <marc...@s1mbi0se.com.br> wrote:
>> > Hi,
>> >
>> > I have the need to executing a map/reduce job to identity data stored in
>> > Cassandra before indexing this data to Elastic Search.
>> >
>> > I have already used ColumnFamilyInputFormat (before start using CQL) to
>> > write hadoop jobs to do that, but I use to have a lot of troubles to
>> > perform
>> > tunning, as hadoop depends on how map tasks are split in order to
>> > successfull execute things in parallel, for IO/bound processes.
>> >
>> > First question is: Am I the only one having problems with that? Is
>> > anyone
>> > else using hadoop jobs that reads from Cassandra in production?
>> >
>> > Second question is about the alternatives. I saw new version spark will
>> > have
>> > Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried
>> > to
>> > use HIVE with Cassandra community, but it seems it only works with
>> > Cassandra
>> > Enterprise and doesn't do more than FB presto (http://prestodb.io/),
>> > which
>> > we have been using reading from Cassandra and so far it has been great
>> > for
>> > SQL-like queries. For custom map reduce jobs, however, it is not enough.
>> >
>> > Does anyone know some other tool that performs MR on Cassandra? My
>> > impression is most tools were created to work on top of HDFS and reading
>> > from a nosql db is some kind of "workaround".
>> >
>> > Third question is about how these tools work. Most of them writtes
>> > mapped
>> > data on a intermediate storage, then data is shuffled and sorted, then
>> > it is
>> > reduced. Even when using CqlPagingInputFormat, if you are using hadoop
>> > it
>> > will write files to HDFS after the mapping phase, shuffle and sort this
>> > data, and then reduce it.
>> >
>> > I wonder if a tool supporting Cassandra out of the box wouldn't be
>> > smarter.
>> > Is it faster to write all your data to a file and then sorting it, or
>> > batch
>> > inserting data and already indexing it, as it happens when you store
>> > data in
>> > a Cassandra CF? I didn't do the calculations to check the complexity of
>> > each
>> > one, what should consider no index in Cassandra would be really large,
>> > as
>> > the maximum index size will always depend on the maximum capacity of a
>> > single host, but my guess is that a map / reduce tool written
>> > specifically
>> > to Cassandra, from the beggining, could perform much better than a tool
>> > written to HDFS and adapted. I hear people saying Map/Reduce on
>> > Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really
>> > make
>> > sense? Should we expect a result like this?
>> >
>> > Final question: Do you think writting a new M/R tool like described
>> > would be
>> > reinventing the wheel? Or it makes sense?
>> >
>> > Thanks in advance. Any opinions about this subject will be very
>> > appreciated.
>> >
>> > Best regards,
>> > Marcelo Valle.
>>
>>
>>
>> --
>> Jon Haddad
>> http://www.rustyrazorblade.com
>> skype: rustyrazorblade
>
>



-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Reply via email to