[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Apache Wiki Mon, 26 Sep 2011 10:59:59 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=40&rev2=41

Comment:
Added a section on Oozie and another cluster configuration note about 
speculative execution.

   * [[#MapReduce|MapReduce]]
   * [[#Pig|Pig]]
   * [[#Hive|Hive]]
+  * [[#Oozie|Oozie]]
   * [[#ClusterConfig|Cluster Configuration]]
   * [[#Troubleshooting|Troubleshooting]]
   * [[#Support|Support]]
@@ -16, +17 @@

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] 
functionality against Cassandra's data store.  Specifically, support has been 
added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by DataStax.  Future development of Brisk is now part of a 
pay-for offering.
+ [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by DataStax.  Future DataStax development of Brisk is now part 
of a pay-for offering.
  
  [[#Top|Top]]
  
@@ -34, +35 @@

              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
  As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the `word_count` and `pig` contrib 
modules for more details.
+ 
  
  ==== Output To Cassandra ====
  As of 0.7, there is a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The `contrib/word_count` example in 0.7 contains two 
reducers - one for outputting data to the filesystem and one to output data to 
Cassandra (default) using this new mechanism.  See that example in the latest 
release for details.
@@ -65, +67 @@

  
  [[#Top|Top]]
  
+ <<Anchor(Oozie)>>
+ 
+ == Oozie ==
+ [[http://incubator.apache.org/oozie/|Oozie]], the open-source workflow engine 
originally from Yahoo!, can be used with Cassandra/Hadoop.  Cassandra 
configuration information needs to go into the oozie action configuration like 
so:
+ {{{
+ <property>
+     <name>cassandra.thrift.address</name>
+     <value>${cassandraHost}</value>
+ </property>
+ <property>
+     <name>cassandra.thrift.port</name>
+     <value>${cassandraPort}</value>
+ </property>
+ <property>
+     <name>cassandra.partitioner.class</name>
+     <value>org.apache.cassandra.dht.RandomPartitioner</value>
+ </property>
+ <property>
+     <name>cassandra.consistencylevel.read</name>
+     <value>${cassandraReadConsistencyLevel}</value>
+ </property>
+ <property>
+     <name>cassandra.consistencylevel.write</name>
+     <value>${cassandraWriteConsistencyLevel}</value>
+ </property>
+ <property>
+     <name>cassandra.range.batch.size</name>
+     <value>${cassandraRangeBatchSize}</value>
+ </property>
+ }}}
+ Note that with Oozie you can specify values outright like the partitioner 
here, or via variable that is typically found in the properties file.
+ One other item of note is that Oozie assumes that it can detect a filemarker 
for successful completion of the job.  This means that when writing to 
Cassandra with, for example, Pig, the Pig script will succeed but the Oozie job 
that called it will fail because filemarkers aren't written to Cassandra.  So 
when you write to Cassandra with Hadoop, specify this property to avoid that 
check.  Oozie will still get completion updates from a callback from the job 
tracker, but it just won't look for the filemarker.
+ {{{
+ <property>
+     <name>mapreduce.fileoutputcommitter.marksuccessfuljobs</name>
+     <value>false</value>
+ </property>
+ }}}
+ 
+ [[#Top|Top]]
+ 
  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
@@ -74, +117 @@

  Otherwise, if you would like to configure a Cassandra cluster yourself so 
that Hadoop may operate over its data, it's best to overlay a Hadoop cluster 
over your Cassandra nodes.  You'll want to have a separate server for your 
Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of 
your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
  
  The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to Cassandra and you are able to access that data with high 
random-read performance.
+ 
+ A note on speculative execution: you may want to disable speculative 
execution for your hadoop jobs that either read or write to Cassandra.  This 
isn't required, but may be helpful to reduce unnecessary load.
  
  One configuration note on getting the task trackers to be able to perform 
queries over Cassandra:  you'll want to update your `HADOOP_CLASSPATH` in your 
`<hadoop>/conf/hadoop-env.sh` to include the Cassandra lib libraries.  For 
example you'll want to do something like this in the `hadoop-env.sh` on each of 
your task trackers:

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Reply via email to