[Cassandra Wiki] Update of "HadoopSupport" by jeremyhan na

Apache Wiki Mon, 25 Oct 2010 09:44:02 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna.
The comment on this change is: Consolidating the cluster configuration stuff..
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=16&rev2=17

--------------------------------------------------

              SlicePredicate predicate = new 
SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
- Cassandra's splits are location-aware (this is the nature of the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]]
 design).  Cassandra  gives the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]]
 a list of locations with each split of data.  That way, the !JobTracker can 
try to preserve data locality when assigning tasks to 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s.
  Therefore, when using Hadoop alongside Cassandra, it is best to have a 
!TaskTracker running on each Cassandra node.
  
- As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the READMEs in the word_count and pig contrib modules for 
more details.
+ As of 0.7, configuration for Hadoop no longer resides in your job's specific 
storage-conf.xml. See the `README` in the word_count and pig contrib modules 
for more details.
  
  ==== Output To Cassandra ====
  
@@ -78, +77 @@

  
  == Cluster Configuration ==
  
- If you would like to configure a Cassandra cluster so that Hadoop may operate 
over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. 
 You'll want to have a separate server for your Hadoop `namenode`/`jobtracker`. 
 Then install Hadoop `tasktracker`s on each of your Cassandra nodes.  That will 
allow the `jobtracker` to assign tasks to the Cassandra nodes that contain data 
for those tasks.  At least one node in your cluster will also need to be a 
`datanode`.  That's because Hadoop uses HDFS to store information like jar 
dependencies for your job, static data (like stop words for a word count), and 
things like that - it's the distributed cache.  It's a very small amount of 
data but the Hadoop cluster needs it to run properly.
+ If you would like to configure a Cassandra cluster so that Hadoop may operate 
over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. 
 You'll want to have a separate server for your Hadoop namenode/`JobTracker`.  
Then install a Hadoop `TaskTracker` on each of your Cassandra nodes.  That will 
allow the `Jobtracker` to assign tasks to the Cassandra nodes that contain data 
for those tasks.  At least one node in your cluster will also need to be a 
datanode.  That's because Hadoop uses HDFS to store information like jar 
dependencies for your job, static data (like stop words for a word count), and 
things like that - it's the distributed cache.  It's a very small amount of 
data but the Hadoop cluster needs it to run properly.
  
- The nice thing about having `tasktracker`s on every node is that 1, you get 
data locality and 2, your analytics engine scales with your data.
+ The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data.
  
  [[#Top|Top]]

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhan na

Reply via email to