Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "HadoopSupport" page has been changed by jeremyhanna: http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=40&rev2=41 Comment: Added a section on Oozie and another cluster configuration note about speculative execution. * [[#MapReduce|MapReduce]] * [[#Pig|Pig]] * [[#Hive|Hive]] + * [[#Oozie|Oozie]] * [[#ClusterConfig|Cluster Configuration]] * [[#Troubleshooting|Troubleshooting]] * [[#Support|Support]] @@ -16, +17 @@ == Overview == Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] functionality against Cassandra's data store. Specifically, support has been added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], [[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]]. - [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by DataStax. Future development of Brisk is now part of a pay-for offering. + [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop distribution called Brisk. ([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) ([[http://github.com/riptano/brisk|Code]]) However this code is no longer going to be maintained by DataStax. Future DataStax development of Brisk is now part of a pay-for offering. [[#Top|Top]] @@ -34, +35 @@ ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate); }}} As of 0.7, configuration for Hadoop no longer resides in your job's specific storage-conf.xml. See the `README` in the `word_count` and `pig` contrib modules for more details. + ==== Output To Cassandra ==== As of 0.7, there is a basic mechanism included in Cassandra for outputting data to Cassandra. The `contrib/word_count` example in 0.7 contains two reducers - one for outputting data to the filesystem and one to output data to Cassandra (default) using this new mechanism. See that example in the latest release for details. @@ -65, +67 @@ [[#Top|Top]] + <<Anchor(Oozie)>> + + == Oozie == + [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine originally from Yahoo!, can be used with Cassandra/Hadoop. Cassandra configuration information needs to go into the oozie action configuration like so: + {{{ + <property> + <name>cassandra.thrift.address</name> + <value>${cassandraHost}</value> + </property> + <property> + <name>cassandra.thrift.port</name> + <value>${cassandraPort}</value> + </property> + <property> + <name>cassandra.partitioner.class</name> + <value>org.apache.cassandra.dht.RandomPartitioner</value> + </property> + <property> + <name>cassandra.consistencylevel.read</name> + <value>${cassandraReadConsistencyLevel}</value> + </property> + <property> + <name>cassandra.consistencylevel.write</name> + <value>${cassandraWriteConsistencyLevel}</value> + </property> + <property> + <name>cassandra.range.batch.size</name> + <value>${cassandraRangeBatchSize}</value> + </property> + }}} + Note that with Oozie you can specify values outright like the partitioner here, or via variable that is typically found in the properties file. + One other item of note is that Oozie assumes that it can detect a filemarker for successful completion of the job. This means that when writing to Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job that called it will fail because filemarkers aren't written to Cassandra. So when you write to Cassandra with Hadoop, specify this property to avoid that check. Oozie will still get completion updates from a callback from the job tracker, but it just won't look for the filemarker. + {{{ + <property> + <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name> + <value>false</value> + </property> + }}} + + [[#Top|Top]] + <<Anchor(ClusterConfig)>> == Cluster Configuration == @@ -74, +117 @@ Otherwise, if you would like to configure a Cassandra cluster yourself so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop `NameNode`/`JobTracker`. Then install a Hadoop `TaskTracker` on each of your Cassandra nodes. That will allow the `JobTracker` to assign tasks to the Cassandra nodes that contain data for those tasks. Also install a Hadoop `DataNode` on each Cassandra node. Hadoop requires a distributed filesystem for copying dependency jars, static data, and intermediate results to be stored. The nice thing about having a `TaskTracker` on every node is that you get data locality and your analytics engine scales with your data. You also never need to shuttle around your data once you've performed analytics on it - you simply output to Cassandra and you are able to access that data with high random-read performance. + + A note on speculative execution: you may want to disable speculative execution for your hadoop jobs that either read or write to Cassandra. This isn't required, but may be helpful to reduce unnecessary load. One configuration note on getting the task trackers to be able to perform queries over Cassandra: you'll want to update your `HADOOP_CLASSPATH` in your `<hadoop>/conf/hadoop-env.sh` to include the Cassandra lib libraries. For example you'll want to do something like this in the `hadoop-env.sh` on each of your task trackers: