[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Apache Wiki Mon, 09 Apr 2012 05:58:49 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=47&rev2=48

Comment:
Removing brisk from cluster config as it will now only confuse people.

  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- The simplest way to configure your cluster to run Cassandra with Hadoop is to 
use Brisk, the open-source packaging of Cassandra with Hadoop.  That will start 
the `JobTracker` and `TaskTracker` processes for you.  It also uses CFS, an 
HDFS compatible distributed filesystem built on Cassandra that removes the need 
for a Hadoop `NameNode` and `DataNode` processes.  For details, see the Brisk 
[[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and 
[[http://github.com/riptano/brisk|code]]
+ If you would like to configure a Cassandra cluster yourself so that Hadoop 
may operate over its data, it's best to overlay a Hadoop cluster over your 
Cassandra nodes.  You'll want to have a separate server for your Hadoop 
`NameNode`/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of your 
Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
  
- Otherwise, if you would like to configure a Cassandra cluster yourself so 
that Hadoop may operate over its data, it's best to overlay a Hadoop cluster 
over your Cassandra nodes.  You'll want to have a separate server for your 
Hadoop `NameNode`/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of 
your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
- 
- The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to Cassandra and you are able to access that data with high 
random-read performance.
+ The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to Cassandra and you are able to access that data with high 
random-read performance. Note that Cassandra implements the same interface as 
HDFS to achieve data locality.
  
  A note on speculative execution: you may want to disable speculative 
execution for your hadoop jobs that either read or write to Cassandra.  This 
isn't required, but may be helpful to reduce unnecessary load.

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Reply via email to