[Cassandra Wiki] Update of "HadoopSupport" by jeremyhan na

Apache Wiki Sat, 16 Oct 2010 09:30:08 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna.
The comment on this change is: Trying to update the hadoop support page with 
more recent info + more structure for linking..
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=14&rev2=15

--------------------------------------------------

+ <<Anchor(Top)>>
+ 
+ == Contents ==
+  * [[#overview|Overview]]
+  * [[#MapReduce|MapReduce Support]]
+  * [[#Pig|Pig Support]]
+  * [[#Hive|Hive Support]]
+ 
+ <<Anchor(Overview)>>
+ 
  == Overview ==
- Cassandra version 0.6 and later enable certain Hadoop functionality against 
Cassandra's data store.  Specifically, support has been added for 
[[http://hadoop.apache.org/mapreduce/|MapReduce]] and 
[[http://hadoop.apache.org/pig/|Pig]].
+ Cassandra version 0.6 and later enable certain Hadoop functionality against 
Cassandra's data store.  Specifically, support has been added for 
[[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://hadoop.apache.org/pig/|Pig]] and [[http://hive.apache.org/|Hive]].
+ 
+ [[#Top|Top]]
+ 
+ <<Anchor(MapReduce)>>
  
  == MapReduce ==
+ 
+ ==== Input from Cassandra ====
- While writing output to Cassandra has always been possible by implementing 
certain interfaces from the Hadoop library, version 0.6 of Cassandra added 
support for retrieving data from Cassandra.  Cassandra 0.6 adds implementations 
of 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
 and 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
 so that Hadoop [[http://hadoop.apache.org/mapreduce/|MapReduce]] jobs can 
retrieve data from Cassandra.  For an example of how this works, see the 
contrib/word_count example in 0.6 or later.  Cassandra rows or row  fragments 
(that is, pairs of key + `SortedMap`  of columns) are input to Map tasks for  
processing by your job, as specified by a `SlicePredicate`  that describes 
which columns to fetch from each row.
+ Cassandra 0.6 (and later) adds support for retrieving data from Cassandra.  
This is based on implementations of 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]],
 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputFormat.html|InputFormat]],
 and 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader]]
 so that Hadoop MapReduce jobs can retrieve data from Cassandra.  For an 
example of how this works, see the contrib/word_count example in 0.6 or later.  
Cassandra rows or row  fragments (that is, pairs of key + `SortedMap`  of 
columns) are input to Map tasks for  processing by your job, as specified by a 
`SlicePredicate`  that describes which columns to fetch from each row.
  
  Here's how this looks in the word_count example, which selects just one  
configurable columnName from each row:
  
@@ -13, +29 @@

              SlicePredicate predicate = new 
SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
- Cassandra's splits are location-aware (this is the nature of the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]]
 design).  Cassandra  gives the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]]
 a list of locations with each split of data.  That way, the !JobTracker can 
try to preserve data locality when  assigning tasks to 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s.
  Therefore, when using Hadoop alongside  Cassandra, it is best to have a 
!TaskTracker running on the same node as  the Cassandra nodes, if data locality 
while processing is desired and to  minimize copying data between Cassandra and 
Hadoop nodes.
+ Cassandra's splits are location-aware (this is the nature of the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]]
 design).  Cassandra  gives the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]]
 a list of locations with each split of data.  That way, the !JobTracker can 
try to preserve data locality when assigning tasks to 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s.
  Therefore, when using Hadoop alongside Cassandra, it is best to have a 
!TaskTracker running on each Cassandra node.
  
- As of 0.7, there will be a basic mechanism included in Cassandra for  
outputting data to cassandra.  See 
[[https://issues.apache.org/jira/browse/CASSANDRA-1101|CASSANDRA-1101]]  for 
details.
+ As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the READMEs in the word_count and pig contrib modules for 
more details.
+ 
+ ==== Output To Cassandra ====
+ 
+ As of 0.7, there is be a basic mechanism included in Cassandra for outputting 
data to Cassandra.  The contrib/word_count example in 0.7 contains two reducers 
- one for outputting data to the filesystem (default) and one to output data to 
Cassandra using this new mechanism.  See that example in the latest release for 
details.
+ 
+ ==== Hadoop Streaming ====
+ 
+ As of 0.7, there is support for 
[[http://hadoop.apache.org/common/docs/r0.20.0/streaming.html|Hadoop 
Streaming]].  For examples on how to use Streaming with Cassandra, see the 
contrib section of the Cassandra source.  The relevant tickets are 
[[https://issues.apache.org/jira/browse/CASSANDRA-1368|CASSANDRA-1368]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-1497|CASSANDRA-1497]].
+ 
+ ==== Some troubleshooting ====
  
  Releases before  0.6.2/0.7 are affected by a small  resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you  may hit this issue, and workaround it 
by raising the limit of open file  descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on  the hadoop 
job side as a thrift !TimedOutException.
  
@@ -30, +56 @@

  {{{
               ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
  }}}
+ 
+ [[#Top|Top]]
+ 
+ <<Anchor(Pig)>>
+ 
  == Pig ==
- Cassandra 0.6 also adds support for [[http://hadoop.apache.org/pig/|Pig]] 
with its own implementation of 
[[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the contrib/pig example in 0.6 and later.
+ Cassandra 0.6+ also adds support for [[http://hadoop.apache.org/pig/|Pig]] 
with its own implementation of 
[[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the contrib/pig example in 0.6 and later.
+ 
+ [[#Top|Top]]
+ 
+ <<Anchor(Hive)>>
  
  == Hive ==
- Work is being done to add Hive support - see 
[[https://issues.apache.org/jira/browse/HIVE-1434|HIVE-1434]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-913|CASSANDRA-913]]
+ Work is being finalized to add support for Hive - see 
[[https://issues.apache.org/jira/browse/HIVE-1434|HIVE-1434]].
  
+ [[#Top|Top]]
+

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhan na

Reply via email to