map reduce for Cassandra
Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle.
Re: map reduce for Cassandra
Hey Marcelo, You should check out spark. It intelligently deals with a lot of the issues you're mentioning. Al Tobey did a walkthrough of how to set up the OSS side of things here: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html It'll be less work than writing a M/R framework from scratch :) Jon On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: map reduce for Cassandra
Hi Jonathan, Do you know if this RDD can be used with Python? AFAIK, python + Cassandra will be supported just in the next version, but I would like to be wrong... Best regards, Marcelo Valle. 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Hey Marcelo, You should check out spark. It intelligently deals with a lot of the issues you're mentioning. Al Tobey did a walkthrough of how to set up the OSS side of things here: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html It'll be less work than writing a M/R framework from scratch :) Jon On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: map reduce for Cassandra
I haven't tried pyspark yet, but it's part of the distribution. My main language is Python too, so I intend on getting deep into it. On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jonathan, Do you know if this RDD can be used with Python? AFAIK, python + Cassandra will be supported just in the next version, but I would like to be wrong... Best regards, Marcelo Valle. 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Hey Marcelo, You should check out spark. It intelligently deals with a lot of the issues you're mentioning. Al Tobey did a walkthrough of how to set up the OSS side of things here: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html It'll be less work than writing a M/R framework from scratch :) Jon On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade -- Jon Haddad http://www.rustyrazorblade.com skype: rustyrazorblade
Re: map reduce for Cassandra
Jonathan, By what I have read in the docs, Python API has some limitations yet, not being possible to use any hadoop binary input format. The python example for Cassandra is only in the master branch: https://github.com/apache/spark/blob/master/examples/src/main/python/cassandra_inputformat.py I may be lacking knowledge of Spark, but if I understood it correctly, the access to Cassandra data is still made by the CqlPagingInputFormat, from hadoop integration. Here is where I ask: even if Spark supports Cassandra, will it be fast enough? My understanding (please some correct me if I am wrong) is that when you insert N items in a Cassandra CF, you are executing N binary searches to insert the item already indexed by a key. When you read the data, it's already sorted. So you take O(N * log(N)) (binary search complexity to insert all data already sorted. However, by using a fast sort algorithm, you also take O(N * log(N)) to sort the data after ir was inserted, but then using more IO. If I write a job in Spark / Java with Cassandra, how will the mapped data be stored and sorted? Will it be stored in Cassandra too? Will spark run sort after the mapping? Best regards, Marcelo. 2014-07-21 14:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: I haven't tried pyspark yet, but it's part of the distribution. My main language is Python too, so I intend on getting deep into it. On Mon, Jul 21, 2014 at 9:38 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi Jonathan, Do you know if this RDD can be used with Python? AFAIK, python + Cassandra will be supported just in the next version, but I would like to be wrong... Best regards, Marcelo Valle. 2014-07-21 13:06 GMT-03:00 Jonathan Haddad j...@jonhaddad.com: Hey Marcelo, You should check out spark. It intelligently deals with a lot of the issues you're mentioning. Al Tobey did a walkthrough of how to set up the OSS side of things here: http://tobert.github.io/post/2014-07-15-installing-cassandra-spark-stack.html It'll be less work than writing a M/R framework from scratch :) Jon On Mon, Jul 21, 2014 at 8:24 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Hi, I have the need to executing a map/reduce job to identity data stored in Cassandra before indexing this data to Elastic Search. I have already used ColumnFamilyInputFormat (before start using CQL) to write hadoop jobs to do that, but I use to have a lot of troubles to perform tunning, as hadoop depends on how map tasks are split in order to successfull execute things in parallel, for IO/bound processes. First question is: Am I the only one having problems with that? Is anyone else using hadoop jobs that reads from Cassandra in production? Second question is about the alternatives. I saw new version spark will have Cassandra support, but using CqlPagingInputFormat, from hadoop. I tried to use HIVE with Cassandra community, but it seems it only works with Cassandra Enterprise and doesn't do more than FB presto (http://prestodb.io/), which we have been using reading from Cassandra and so far it has been great for SQL-like queries. For custom map reduce jobs, however, it is not enough. Does anyone know some other tool that performs MR on Cassandra? My impression is most tools were created to work on top of HDFS and reading from a nosql db is some kind of workaround. Third question is about how these tools work. Most of them writtes mapped data on a intermediate storage, then data is shuffled and sorted, then it is reduced. Even when using CqlPagingInputFormat, if you are using hadoop it will write files to HDFS after the mapping phase, shuffle and sort this data, and then reduce it. I wonder if a tool supporting Cassandra out of the box wouldn't be smarter. Is it faster to write all your data to a file and then sorting it, or batch inserting data and already indexing it, as it happens when you store data in a Cassandra CF? I didn't do the calculations to check the complexity of each one, what should consider no index in Cassandra would be really large, as the maximum index size will always depend on the maximum capacity of a single host, but my guess is that a map / reduce tool written specifically to Cassandra, from the beggining, could perform much better than a tool written to HDFS and adapted. I hear people saying Map/Reduce on Cassandra/HBase is usually 30% slower than M/R in HDFS. Does it really make sense? Should we expect a result like this? Final question: Do you think writting a new M/R tool like described would be reinventing the wheel? Or it makes sense? Thanks in advance. Any opinions about this subject will be very appreciated. Best regards, Marcelo Valle. -- Jon Haddad http
Re: map reduce for Cassandra
On Mon, Jul 21, 2014 at 10:54 AM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: My understanding (please some correct me if I am wrong) is that when you insert N items in a Cassandra CF, you are executing N binary searches to insert the item already indexed by a key. When you read the data, it's already sorted. So you take O(N * log(N)) (binary search complexity to insert all data already sorted. You're wrong, unless you're talking about insertion into a memtable, which you probably aren't and which probably doesn't actually work that way enough to be meaningful. On disk, Cassandra has immutable datafiles, from which row fragments are merged into a row at read time. I'm pretty sure the rest of the stuff you said doesn't make any sense in light of this? =Rob
Re: map reduce for Cassandra
Hi Robert, First of all, thanks for answering. 2014-07-21 20:18 GMT-03:00 Robert Coli rc...@eventbrite.com: You're wrong, unless you're talking about insertion into a memtable, which you probably aren't and which probably doesn't actually work that way enough to be meaningful. On disk, Cassandra has immutable datafiles, from which row fragments are merged into a row at read time. I'm pretty sure the rest of the stuff you said doesn't make any sense in light of this? Although several sstables (disk fragments) may have the same row key, inside a single sstable row keys and column keys are indexed, right? Otherwise, doing a GET in Cassandra would take some time. From the M/R perspective, I was reffering to the mem table, as I am trying to compare the time to insert in Cassandra against the time of sorting in hadoop. To make it more clear: hadoop has it's own partitioner, which is used after the map phase. The map output is written locally on each hadoop node, then it's shuffled from one node to the other (see slide 17 in this presentation: http://pt.slideshare.net/j_singh/nosql-and-mapreduce). In other words, you may read Cassandra data on hadoop, but the intermediate results are still stored in HDFS. Instead of using hadoop partitioner, I would like to store the intermediate results in a Cassandra CF, so the map output would go directly to an intermediate column family via batch inserts, instead of being written to a local disk first, then shuffled to the right node. Therefore, the mapper would write it's output the same way all data enters in Cassandra: first on a memtable, then being flush to a sstable, then read during the reduce phase. Shouldn't it be faster than storing intermediate results in HDFS? Best regards, Marcelo.
Re: map reduce for Cassandra
On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Although several sstables (disk fragments) may have the same row key, inside a single sstable row keys and column keys are indexed, right? Otherwise, doing a GET in Cassandra would take some time. From the M/R perspective, I was reffering to the mem table, as I am trying to compare the time to insert in Cassandra against the time of sorting in hadoop. I was confused, because unless you are using new in-memory columnfamilies, which I believe are only available in DSE, there is no way to ensure that any given row stays in a memtable. Very rarely is there a view of the function of a memtable that only cares about its properties and not the closely related properties of SSTables. However yours is one of them, I see now why your question makes sense, you only care about the memtable for how quickly it sorts. But if you are only relying on memtables to sort writes, that seems like a pretty heavyweight reason to use Cassandra? I'm certainly not an expert in this area of Cassandra... but Cassandra, as a datastore with immutable data files, is not typically a good choice for short lived intermediate result sets... are you planning to use DSE? =Rob
Re: map reduce for Cassandra
Hi, But if you are only relying on memtables to sort writes, that seems like a pretty heavyweight reason to use Cassandra? Actually, it's not a reason to use Cassandra. I already use Cassandra and I need to map reduce data from it. I am trying to see a reason to use the conventional M/R tools or to build a tool specific to Cassandra. but Cassandra, as a datastore with immutable data files, is not typically a good choice for short lived intermediate result sets... Indeed, but so far I am seeing it as the best option. I storing this intermediate files in HDFS is better, then I agree there is no reason to consider Cassandra to do it. are you planning to use DSE? Our company will probably hire DSE support when it reaches some size, but DSE as a product doesn't seem interesting to our case so far. The only tool that would help be at this moment would be HIVE, but honestly I didn't like the way DSE supports hive and I don't want to use a solution not available to DSC (see http://stackoverflow.com/questions/23959169/problems-using-hive-cassandra-community for details). []s 2014-07-21 22:09 GMT-03:00 Robert Coli rc...@eventbrite.com: On Mon, Jul 21, 2014 at 5:45 PM, Marcelo Elias Del Valle marc...@s1mbi0se.com.br wrote: Although several sstables (disk fragments) may have the same row key, inside a single sstable row keys and column keys are indexed, right? Otherwise, doing a GET in Cassandra would take some time. From the M/R perspective, I was reffering to the mem table, as I am trying to compare the time to insert in Cassandra against the time of sorting in hadoop. I was confused, because unless you are using new in-memory columnfamilies, which I believe are only available in DSE, there is no way to ensure that any given row stays in a memtable. Very rarely is there a view of the function of a memtable that only cares about its properties and not the closely related properties of SSTables. However yours is one of them, I see now why your question makes sense, you only care about the memtable for how quickly it sorts. But if you are only relying on memtables to sort writes, that seems like a pretty heavyweight reason to use Cassandra? I'm certainly not an expert in this area of Cassandra... but Cassandra, as a datastore with immutable data files, is not typically a good choice for short lived intermediate result sets... are you planning to use DSE? =Rob
Re: Pig / Map Reduce on Cassandra
astyanax library, which seems to fail horribly on 1.2, so we're still on 1.1.8. But we're just starting out with this and i'm still debating between cassandra and hbase. So I just want to know if there is a limitation here or not, as I have no idea when 1.2 support will exist in astyanax. That said, are there other java (scala) libraries that people use to connect to cassandra that support 1.2? -James- On Thu, Jan 17, 2013 at 8:30 AM, cscetbon@orange.commailto:cscetbon@orange.com wrote: Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.commailto:dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified
Re: Pig / Map Reduce on Cassandra
that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain
Re: Pig / Map Reduce on Cassandra
Ok forget it. It was a mix of mistakes like environment variables not set, package name not added in the script and libraries not found. Regards -- Cyril SCETBON On Mar 12, 2013, at 10:43 AM, cscetbon@orange.commailto:cscetbon@orange.com wrote: I'm already using Cassandra 1.2.2 with only one line to test the cassandra access : rows = LOAD 'cassandra://twissandra/users' USING org.apache.cassandra.hadoop.pig.CassandraStorage(); extracted from the sample script provided in the sources -- Cyril SCETBON On Mar 12, 2013, at 6:57 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: any idea why the function loadFunc does not work correctly ? No sorry. Not sure why you are linking to the CQL info or what Pig script / config you are running. Did you follow the example in the examples/pig in the source distribution ? Also please use at least cassandra 1.1. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 11/03/2013, at 9:39 AM, cscetbon@orange.commailto:cscetbon@orange.com wrote: You said all versions. However, when I try to access cassandra://twissandra/users based on http://www.datastax.com/docs/1.0/dml/using_cql I get : 2013-03-11 17:35:48,444 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0 (r1446324) compiled Feb 14 2013, 16:40:57 2013-03-11 17:35:48,445 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/cyril/pig_1363019748442.log 2013-03-11 17:35:48.583 java[13809:1203] Unable to load realm info from SCDynamicStore 2013-03-11 17:35:48,750 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /Users/cyril/.pigbootup not found 2013-03-11 17:35:48,831 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///file: 2013-03-11 17:35:49,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage with pig 0.11.0 any idea why the function loadFunc does not work correctly ? thanks -- Cyril SCETBON On Jan 18, 2013, at 7:00 PM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? all versions. We are using astyanax library, which seems to fail horribly on 1.2, How does it fail ? If you think you have a bug post it at https://github.com/Netflix/astyanax Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 18/01/2013, at 7:48 AM, James Lyons james.ly...@gmail.commailto:james.ly...@gmail.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? We are using astyanax library, which seems to fail horribly on 1.2, so we're still on 1.1.8. But we're just starting out with this and i'm still debating between cassandra and hbase. So I just want to know if there is a limitation here or not, as I have no idea when 1.2 support will exist in astyanax. That said, are there other java (scala) libraries that people use to connect to cassandra that support 1.2? -James- On Thu, Jan 17, 2013 at 8:30 AM, cscetbon@orange.commailto:cscetbon@orange.com wrote: Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com
Re: Pig / Map Reduce on Cassandra
:58 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.commailto:dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses
Re: Pig / Map Reduce on Cassandra
to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.commailto:dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci
Re: Pig / Map Reduce on Cassandra
You said all versions. However, when I try to access cassandra://twissandra/users based on http://www.datastax.com/docs/1.0/dml/using_cql I get : 2013-03-11 17:35:48,444 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0 (r1446324) compiled Feb 14 2013, 16:40:57 2013-03-11 17:35:48,445 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/cyril/pig_1363019748442.log 2013-03-11 17:35:48.583 java[13809:1203] Unable to load realm info from SCDynamicStore 2013-03-11 17:35:48,750 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /Users/cyril/.pigbootup not found 2013-03-11 17:35:48,831 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:///file: 2013-03-11 17:35:49,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage with pig 0.11.0 any idea why the function loadFunc does not work correctly ? thanks -- Cyril SCETBON On Jan 18, 2013, at 7:00 PM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? all versions. We are using astyanax library, which seems to fail horribly on 1.2, How does it fail ? If you think you have a bug post it at https://github.com/Netflix/astyanax Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 18/01/2013, at 7:48 AM, James Lyons james.ly...@gmail.commailto:james.ly...@gmail.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? We are using astyanax library, which seems to fail horribly on 1.2, so we're still on 1.1.8. But we're just starting out with this and i'm still debating between cassandra and hbase. So I just want to know if there is a limitation here or not, as I have no idea when 1.2 support will exist in astyanax. That said, are there other java (scala) libraries that people use to connect to cassandra that support 1.2? -James- On Thu, Jan 17, 2013 at 8:30 AM, cscetbon@orange.commailto:cscetbon@orange.com wrote: Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et
Re: Pig / Map Reduce on Cassandra
any idea why the function loadFunc does not work correctly ? No sorry. Not sure why you are linking to the CQL info or what Pig script / config you are running. Did you follow the example in the examples/pig in the source distribution ? Also please use at least cassandra 1.1. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 11/03/2013, at 9:39 AM, cscetbon@orange.com wrote: You said all versions. However, when I try to access cassandra://twissandra/users based on http://www.datastax.com/docs/1.0/dml/using_cql I get : 2013-03-11 17:35:48,444 [main] INFO org.apache.pig.Main - Apache Pig version 0.11.0 (r1446324) compiled Feb 14 2013, 16:40:57 2013-03-11 17:35:48,445 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/cyril/pig_1363019748442.log 2013-03-11 17:35:48.583 java[13809:1203] Unable to load realm info from SCDynamicStore 2013-03-11 17:35:48,750 [main] INFO org.apache.pig.impl.util.Utils - Default bootup file /Users/cyril/.pigbootup not found 2013-03-11 17:35:48,831 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2013-03-11 17:35:49,235 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2245: Cannot get schema from loadFunc org.apache.cassandra.hadoop.pig.CassandraStorage with pig 0.11.0 any idea why the function loadFunc does not work correctly ? thanks -- Cyril SCETBON On Jan 18, 2013, at 7:00 PM, aaron morton aa...@thelastpickle.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? all versions. We are using astyanax library, which seems to fail horribly on 1.2, How does it fail ? If you think you have a bug post it at https://github.com/Netflix/astyanax Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/01/2013, at 7:48 AM, James Lyons james.ly...@gmail.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? We are using astyanax library, which seems to fail horribly on 1.2, so we're still on 1.1.8. But we're just starting out with this and i'm still debating between cassandra and hbase. So I just want to know if there is a limitation here or not, as I have no idea when 1.2 support will exist in astyanax. That said, are there other java (scala) libraries that people use to connect to cassandra that support 1.2? -James- On Thu, Jan 17, 2013 at 8:30 AM, cscetbon@orange.com wrote: Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either
Re: Pig / Map Reduce on Cassandra
Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? all versions. We are using astyanax library, which seems to fail horribly on 1.2, How does it fail ? If you think you have a bug post it at https://github.com/Netflix/astyanax Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 18/01/2013, at 7:48 AM, James Lyons james.ly...@gmail.com wrote: Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? We are using astyanax library, which seems to fail horribly on 1.2, so we're still on 1.1.8. But we're just starting out with this and i'm still debating between cassandra and hbase. So I just want to know if there is a limitation here or not, as I have no idea when 1.2 support will exist in astyanax. That said, are there other java (scala) libraries that people use to connect to cassandra that support 1.2? -James- On Thu, Jan 17, 2013 at 8:30 AM, cscetbon@orange.com wrote: Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential
Re: Pig / Map Reduce on Cassandra
what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.commailto:dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra- 1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.commailto:dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1. 2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. __ ___ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.commailto:jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.commailto:cscetbon@orange.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.commailto:dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages
Re: Pig / Map Reduce on Cassandra
Silly question -- but does hive/pig hadoop etc work with cassandra 1.1.8? Or only with 1.2? We are using astyanax library, which seems to fail horribly on 1.2, so we're still on 1.1.8. But we're just starting out with this and i'm still debating between cassandra and hbase. So I just want to know if there is a limitation here or not, as I have no idea when 1.2 support will exist in astyanax. That said, are there other java (scala) libraries that people use to connect to cassandra that support 1.2? -James- On Thu, Jan 17, 2013 at 8:30 AM, cscetbon@orange.com wrote: Ok, I understand that I need to manage both cassandra and hadoop components and that pig will use hadoop components to launch its tasks which will use Cassandra as the Storage engine. Thanks -- Cyril SCETBON On Jan 17, 2013, at 4:03 PM, James Schappet jschap...@gmail.com wrote: This really depends on how you design your Hadoop Cluster. The testing I have done, had Hadoop and Cassandra Nodes collocated on the same hosts. Remember that Pig code runs inside of your hadoop cluster, and connects to Cassandra as the Database engine. I have not done any testing with Hive, so someone else will have to answer that question. From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 8:58 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra Jimmy, I understand that CFS can replace HDFS for those who use Hadoop. I just want to use pig and hive on cassandra. I know that pig samples are provided and work now with cassandra natively (they are part of the core). However, does it mean that the process will be spread over nodes with number_of_mapper=number_of_nodes or something like that ? Can Hive connect to Cassandra 1.2 easily too ? -- Cyril Scetbon On Jan 17, 2013, at 2:42 PM, James Schappet jschap...@gmail.com wrote: CFS is Cassandra File System: http://www.datastax.com/dev/blog/cassandra-file-system-design But you don't need CFS to connect from PIG to Cassandra. The latest versions of Cassandra Source ship with examples of connecting from pig to cassandra. apache-cassandra-1.2.0-src/examples/pig -- http://www.apache.org/dyn/closer.cgi?path=/cassandra/1.2.0/apache-cassandra-1.2.0-src.tar.gz --Jimmy From: cscetbon@orange.com Reply-To: user@cassandra.apache.org Date: Thursday, January 17, 2013 6:35 AM To: user@cassandra.apache.org user@cassandra.apache.org Subject: Re: Pig / Map Reduce on Cassandra what do you mean ? it's not needed by Pig or Hive to access Cassandra data. Regards On Jan 16, 2013, at 11:14 PM, Brandon Williams dri...@gmail.com wrote: You won't get CFS, but it's not a hard requirement, either. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message
Pig / Map Reduce on Cassandra
Hi, I know that DataStax Enterprise package provide Brisk, but is there a community version ? Is it easy to interface Hadoop with Cassandra as the storage or do we absolutely have to use Brisk for that ? I know CassandraFS is natively available in cassandra 1.2, the version I use, so is there a way/procedure to interface hadoop with Cassandra as the storage ? Thanks _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
Here are a few examples I have worked on, reading from xml.gz files then writing to cassandara. https://github.com/jschappet/medline You will also need: https://github.com/jschappet/medline-base These examples are Hadoop Jobs using Cassandra as the Data Store. This one is a good place to start. https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ic ts/jobs/LoadMedline/StartJob.java ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, outputPath); job.setMapperClass(MapperToCassandra.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); LOG.info(Writing output to Cassandra); //job.setReducerClass(ReducerToCassandra.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com wrote: Hi, I know that DataStax Enterprise package provide Brisk, but is there a community version ? Is it easy to interface Hadoop with Cassandra as the storage or do we absolutely have to use Brisk for that ? I know CassandraFS is natively available in cassandra 1.2, the version I use, so is there a way/procedure to interface hadoop with Cassandra as the storage ? Thanks __ ___ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
I don't want to write to Cassandra as it replicates data from another datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read data from it. I would like to use the same configuration as http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-cluster but I want to know if there are alternatives to DataStax Enterprise package. Thanks On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com wrote: Here are a few examples I have worked on, reading from xml.gz files then writing to cassandara. https://github.com/jschappet/medline You will also need: https://github.com/jschappet/medline-base These examples are Hadoop Jobs using Cassandra as the Data Store. This one is a good place to start. https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ic ts/jobs/LoadMedline/StartJob.java ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, outputPath); job.setMapperClass(MapperToCassandra.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); LOG.info(Writing output to Cassandra); //job.setReducerClass(ReducerToCassandra.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com wrote: Hi, I know that DataStax Enterprise package provide Brisk, but is there a community version ? Is it easy to interface Hadoop with Cassandra as the storage or do we absolutely have to use Brisk for that ? I know CassandraFS is natively available in cassandra 1.2, the version I use, so is there a way/procedure to interface hadoop with Cassandra as the storage ? Thanks __ ___ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you. _ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments. As emails may be altered, France Telecom - Orange is not liable for messages that have been modified, changed or falsified. Thank you.
Re: Pig / Map Reduce on Cassandra
Try this one then, it reads from cassandra, then writes back to cassandra, but you could change the write to where ever you would like. getConf().set(IN_COLUMN_NAME, columnName ); Job job = new Job(getConf(), ProcessRawXml); job.setInputFormatClass(ColumnFamilyInputFormat.class); job.setNumReduceTasks(0); job.setJarByClass(StartJob.class); job.setMapperClass(ParseMapper.class); job.setOutputKeyClass(ByteBuffer.class); //job.setOutputValueClass(Text.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); job.setInputFormatClass(ColumnFamilyInputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(columnN ame))); // SliceRange slice_range = new SliceRange(); // slice_range.setStart(ByteBufferUtil.bytes(startPoint)); // slice_range.setFinish(ByteBufferUtil.bytes(endPoint)); // // predicate.setSlice_range(slice_range); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); job.waitForCompletion(true); https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ic ts/jobs/ProcessXml/StartJob.java On 1/16/13 9:22 AM, cscetbon@orange.com cscetbon@orange.com wrote: I don't want to write to Cassandra as it replicates data from another datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read data from it. I would like to use the same configuration as http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-cluster but I want to know if there are alternatives to DataStax Enterprise package. Thanks On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com wrote: Here are a few examples I have worked on, reading from xml.gz files then writing to cassandara. https://github.com/jschappet/medline You will also need: https://github.com/jschappet/medline-base These examples are Hadoop Jobs using Cassandra as the Data Store. This one is a good place to start. https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ ic ts/jobs/LoadMedline/StartJob.java ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, outputPath); job.setMapperClass(MapperToCassandra.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); LOG.info(Writing output to Cassandra); //job.setReducerClass(ReducerToCassandra.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com wrote: Hi, I know that DataStax Enterprise package provide Brisk, but is there a community version ? Is it easy to interface Hadoop with Cassandra as the storage or do we absolutely have to use Brisk for that ? I know CassandraFS is natively available in cassandra 1.2, the version I use, so is there a way/procedure to interface hadoop with Cassandra as the storage ? Thanks __ ___ Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration, France Telecom - Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci. This message and its attachments may contain confidential or privileged information that may be protected by law; they should not be distributed, used or copied without authorisation. If you have received this email in error, please notify the sender and delete this message and its attachments.
Re: Pig / Map Reduce on Cassandra
Brisk is pretty much stagnant. I think someone forked it to work with 1.0 but not sure how that is going. You'll need to pay for DSE to get CFS (which is essentially Brisk) if you want to use any modern version of C*. Best, Michael On 1/16/13 11:17 AM, cscetbon@orange.com cscetbon@orange.com wrote: Thanks I understand that your code uses the hadoop interface of Cassandra to be able to read from it with a job. However I would like to know how to bring pieces (hive + pig + hadoop) together with cassandra as the storage layer, not to get code to test it. I have found repository https://github.com/riptano/brisk which might be a good start for it Regards On Jan 16, 2013, at 4:27 PM, James Schappet jschap...@gmail.com wrote: Try this one then, it reads from cassandra, then writes back to cassandra, but you could change the write to where ever you would like. getConf().set(IN_COLUMN_NAME, columnName ); Job job = new Job(getConf(), ProcessRawXml); job.setInputFormatClass(ColumnFamilyInputFormat.class); job.setNumReduceTasks(0); job.setJarByClass(StartJob.class); job.setMapperClass(ParseMapper.class); job.setOutputKeyClass(ByteBuffer.class); //job.setOutputValueClass(Text.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); job.setInputFormatClass(ColumnFamilyInputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(colum nN ame))); // SliceRange slice_range = new SliceRange(); // slice_range.setStart(ByteBufferUtil.bytes(startPoint)); // slice_range.setFinish(ByteBufferUtil.bytes(endPoint)); // // predicate.setSlice_range(slice_range); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); job.waitForCompletion(true); https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ ic ts/jobs/ProcessXml/StartJob.java On 1/16/13 9:22 AM, cscetbon@orange.com cscetbon@orange.com wrote: I don't want to write to Cassandra as it replicates data from another datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read data from it. I would like to use the same configuration as http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-clust er but I want to know if there are alternatives to DataStax Enterprise package. Thanks On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com wrote: Here are a few examples I have worked on, reading from xml.gz files then writing to cassandara. https://github.com/jschappet/medline You will also need: https://github.com/jschappet/medline-base These examples are Hadoop Jobs using Cassandra as the Data Store. This one is a good place to start. https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiow a/ ic ts/jobs/LoadMedline/StartJob.java ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, outputPath); job.setMapperClass(MapperToCassandra.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); LOG.info(Writing output to Cassandra); //job.setReducerClass(ReducerToCassandra.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com wrote: Hi, I know that DataStax Enterprise package provide Brisk, but is there a community version ? Is it easy to interface Hadoop with Cassandra as the storage or do we absolutely have to use Brisk for that ? I know CassandraFS is natively available in cassandra 1.2, the version I use, so is there a way/procedure to interface hadoop with Cassandra as the storage ? Thanks __ __ __ ___ Ce message et ses
Re: Pig / Map Reduce on Cassandra
Here is the point. You're right this github repository has not been updated for a year and a half. I thought brisk was just a bundle of some technologies and that it was possible to install the same components and make them work together without using this bundle :( On Jan 16, 2013, at 8:22 PM, Michael Kjellman mkjell...@barracuda.com wrote: Brisk is pretty much stagnant. I think someone forked it to work with 1.0 but not sure how that is going. You'll need to pay for DSE to get CFS (which is essentially Brisk) if you want to use any modern version of C*. Best, Michael On 1/16/13 11:17 AM, cscetbon@orange.com cscetbon@orange.com wrote: Thanks I understand that your code uses the hadoop interface of Cassandra to be able to read from it with a job. However I would like to know how to bring pieces (hive + pig + hadoop) together with cassandra as the storage layer, not to get code to test it. I have found repository https://github.com/riptano/brisk which might be a good start for it Regards On Jan 16, 2013, at 4:27 PM, James Schappet jschap...@gmail.com wrote: Try this one then, it reads from cassandra, then writes back to cassandra, but you could change the write to where ever you would like. getConf().set(IN_COLUMN_NAME, columnName ); Job job = new Job(getConf(), ProcessRawXml); job.setInputFormatClass(ColumnFamilyInputFormat.class); job.setNumReduceTasks(0); job.setJarByClass(StartJob.class); job.setMapperClass(ParseMapper.class); job.setOutputKeyClass(ByteBuffer.class); //job.setOutputValueClass(Text.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); job.setInputFormatClass(ColumnFamilyInputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); SlicePredicate predicate = new SlicePredicate().setColumn_names(Arrays.asList(ByteBufferUtil.bytes(colum nN ame))); // SliceRange slice_range = new SliceRange(); // slice_range.setStart(ByteBufferUtil.bytes(startPoint)); // slice_range.setFinish(ByteBufferUtil.bytes(endPoint)); // // predicate.setSlice_range(slice_range); ConfigHelper.setInputSlicePredicate(job.getConfiguration(), predicate); job.waitForCompletion(true); https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiowa/ ic ts/jobs/ProcessXml/StartJob.java On 1/16/13 9:22 AM, cscetbon@orange.com cscetbon@orange.com wrote: I don't want to write to Cassandra as it replicates data from another datacenter, but I just want to use Hadoop Jobs (Pig and Hive) to read data from it. I would like to use the same configuration as http://www.datastax.com/dev/blog/hadoop-mapreduce-in-the-cassandra-clust er but I want to know if there are alternatives to DataStax Enterprise package. Thanks On Jan 16, 2013, at 3:59 PM, James Schappet jschap...@gmail.com wrote: Here are a few examples I have worked on, reading from xml.gz files then writing to cassandara. https://github.com/jschappet/medline You will also need: https://github.com/jschappet/medline-base These examples are Hadoop Jobs using Cassandra as the Data Store. This one is a good place to start. https://github.com/jschappet/medline/blob/master/src/main/java/edu/uiow a/ ic ts/jobs/LoadMedline/StartJob.java ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY); ConfigHelper.setOutputColumnFamily(job.getConfiguration(), KEYSPACE, outputPath); job.setMapperClass(MapperToCassandra.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); LOG.info(Writing output to Cassandra); //job.setReducerClass(ReducerToCassandra.class); job.setOutputFormatClass(ColumnFamilyOutputFormat.class); ConfigHelper.setRpcPort(job.getConfiguration(), 9160); //org.apache.cassandra.dht.LocalPartitioner ConfigHelper.setInitialAddress(job.getConfiguration(), localhost); ConfigHelper.setPartitioner(job.getConfiguration(), org.apache.cassandra.dht.RandomPartitioner); On 1/16/13 7:37 AM, cscetbon@orange.com cscetbon@orange.com wrote: Hi, I know that DataStax Enterprise package provide Brisk, but is there a community version ? Is it easy to interface Hadoop with Cassandra as the storage or do we absolutely have to use
Re: Pig / Map Reduce on Cassandra
On Wed, Jan 16, 2013 at 2:37 PM, cscetbon@orange.com wrote: Here is the point. You're right this github repository has not been updated for a year and a half. I thought brisk was just a bundle of some technologies and that it was possible to install the same components and make them work together without using this bundle :( You can install hadoop manually alongside Cassandra as well as pig. Pig support is in C*'s tree in o.a.c.hadoop.pig. You won't get CFS, but it's not a hard requirement, either. -Brandon
Map Reduce and Cassandra with Trigger patch
I'm having some problems during running a Map Reduce program using Cassandra as input. I already right some MapRed programs using the cassandra 1.0.9, but now I'm trying with an old version with a patch that supports trigger. (this one: https://issues.apache.org/jira/browse/CASSANDRA-1311) When I try to run, it throws the following error: 12/11/26 16:59:06 ERROR config.DatabaseDescriptor: Fatal error: Cannot locate cassandra.yaml on the classpath I had already this problem and the solution was just add the path of cassandra.yaml to a system property, but now it's not working. I also saw somewhere that one solution would be adding the line set CLASSPATH=%CASSANDRA_HOME%\conf into /bin/cassandra-cli.bat but it also didn't work. If someone has some idea of what to do I will be really thankful. Regards, Felipe Mathias Schmidt *(Computer Science UFRGS, RS, Brazil)* *Most people do not listen with the intent to understand; they listen with the intend to reply - Stephen Covey*
Re: Map/Reduce over Cassandra
Hey Bill, A few months ago we did an experiment with 5 hadoop nodes pulling from 4 cass nodes. It was pulling down 1 column family with 8 small columns just dumping the raw data to hdfs. It was cycling through around 17K map tasks per sec. The machines weren't being taxed too hard, so I'm sure there's some concurrency tuning we could have done to speed that up. Unfortunately we don't have that same data on HDFS yet, so I can't really give a direct comparison. Hope that helps. I'm curious what others have seen as well. On Tue, Aug 17, 2010 at 6:59 PM, Bill Hastings bllhasti...@gmail.com wrote: Hi All How performant is M/R on Cassandra when compared to running it on HDFS? Anyone have any numbers they can share? Specifically how much of data the M/R job was run against and what was the throughput etc. Any information would be very helpful. -- Cheers Bill
Map/Reduce over Cassandra
Hi All How performant is M/R on Cassandra when compared to running it on HDFS? Anyone have any numbers they can share? Specifically how much of data the M/R job was run against and what was the throughput etc. Any information would be very helpful. -- Cheers Bill