I haven't tried pyspark yet, but it's part of the distribution. My main language is Python too, so I intend on getting deep into it.
On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle <marc...@s1mbi0se.com.br> wrote: > Hi Jonathan, > > Do you know if this RDD can be used with Python? AFAIK, python + Cassandra > will be supported just in the next version, but I would like to be wrong... > > Best regards, > Marcelo Valle. > > > > 2014-07-21 13:06 GMT-03:00 Jonathan Haddad <j...@jonhaddad.com>: > >> Hey Marcelo, >> >> You should check out spark. It intelligently deals with a lot of the >> issues you're mentioning. Al Tobey did a walkthrough of how to set up >> the OSS side of things here: >> >> http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html >> >> It'll be less work than writing a M/R framework from scratch :) >> Jon >> >> >> On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle >> <marc...@s1mbi0se.com.br> wrote: >> > Hi, >> > >> > I have the need to executing a map/reduce job to identity data stored in >> > Cassandra before indexing this data to Elastic Search. >> > >> > I have already used ColumnFamilyInputFormat (before start using CQL) to >> > write hadoop jobs to do that, but I use to have a lot of troubles to >> > perform >> > tunning, as hadoop depends on how map tasks are split in order to >> > successfull execute things in parallel, for IO/bound processes. >> > >> > First question is: Am I the only one having problems with that? Is >> > anyone >> > else using hadoop jobs that reads from Cassandra in production? >> > >> > Second question is about the alternatives. I saw new version spark will >> > have >> > Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried >> > to >> > use HIVE with Cassandra community, but it seems it only works with >> > Cassandra >> > Enterprise and doesn't do more than FB presto (http://prestodb.io/), >> > which >> > we have been using reading from Cassandra and so far it has been great >> > for >> > SQL-like queries. For custom map reduce jobs, however, it is not enough. >> > >> > Does anyone know some other tool that performs MR on Cassandra? My >> > impression is most tools were created to work on top of HDFS and reading >> > from a nosql db is some kind of "workaround". >> > >> > Third question is about how these tools work. Most of them writtes >> > mapped >> > data on a intermediate storage, then data is shuffled and sorted, then >> > it is >> > reduced. Even when using CqlPagingInputFormat, if you are using hadoop >> > it >> > will write files to HDFS after the mapping phase, shuffle and sort this >> > data, and then reduce it. >> > >> > I wonder if a tool supporting Cassandra out of the box wouldn't be >> > smarter. >> > Is it faster to write all your data to a file and then sorting it, or >> > batch >> > inserting data and already indexing it, as it happens when you store >> > data in >> > a Cassandra CF? I didn't do the calculations to check the complexity of >> > each >> > one, what should consider no index in Cassandra would be really large, >> > as >> > the maximum index size will always depend on the maximum capacity of a >> > single host, but my guess is that a map / reduce tool written >> > specifically >> > to Cassandra, from the beggining, could perform much better than a tool >> > written to HDFS and adapted. I hear people saying Map/Reduce on >> > Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really >> > make >> > sense? Should we expect a result like this? >> > >> > Final question: Do you think writting a new M/R tool like described >> > would be >> > reinventing the wheel? Or it makes sense? >> > >> > Thanks in advance. Any opinions about this subject will be very >> > appreciated. >> > >> > Best regards, >> > Marcelo Valle. >> >> >> >> -- >> Jon Haddad >> http://www.rustyrazorblade.com >> skype: rustyrazorblade > > -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade