[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Apache Wiki Mon, 01 Aug 2011 07:51:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=37&rev2=38

Comment:
Updated cluster config section.  Took out single datanode thought as 
intermediate results need more than that realistically.  Added bit about Brisk.

  <<Anchor(ClusterConfig)>>
  
  == Cluster Configuration ==
- If you would like to configure a Cassandra cluster so that Hadoop may operate 
over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. 
 You'll want to have a separate server for your Hadoop namenode/`JobTracker`.  
Then install a Hadoop `TaskTracker` on each of your Cassandra nodes.  That will 
allow the `Jobtracker` to assign tasks to the Cassandra nodes that contain data 
for those tasks.  At least one node in your cluster will also need to be a 
datanode.  That's because Hadoop uses HDFS to store information like jar 
dependencies for your job, static data (like stop words for a word count), and 
things like that - it's the distributed cache.  It's a very small amount of 
data but the Hadoop cluster needs it to run properly.
+ 
+ The simplest way to configure your cluster to run Cassandra with Hadoop is to 
use Brisk, the open-source packaging of Cassandra with Hadoop.  That will start 
the `JobTracker` and `TaskTracker` processes for you.  It also uses CFS, an 
HDFS compatible distributed filesystem built on Cassandra that removes the need 
for a Hadoop `NameNode` and `DataNode` processes.  For details, see the Brisk 
[[http://www.datastax.com/docs/0.8/brisk/index|documentation]] and 
[[http://github.com/riptano/brisk|code]]
+ 
+ Otherwise, if you would like to configure a Cassandra cluster yourself so 
that Hadoop may operate over its data, it's best to overlay a Hadoop cluster 
over your Cassandra nodes.  You'll want to have a separate server for your 
Hadoop `NameNode/`JobTracker`.  Then install a Hadoop `TaskTracker` on each of 
your Cassandra nodes.  That will allow the `JobTracker` to assign tasks to the 
Cassandra nodes that contain data for those tasks.  Also install a Hadoop 
`DataNode` on each Cassandra node.  Hadoop requires a distributed filesystem 
for copying dependency jars, static data, and intermediate results to be stored.
  
  The nice thing about having a `TaskTracker` on every node is that you get 
data locality and your analytics engine scales with your data. You also never 
need to shuttle around your data once you've performed analytics on it - you 
simply output to Cassandra and you are able to access that data with high 
random-read performance.
  
@@ -79, +82 @@

  }}}
  ==== Virtual Datacenter ====
  One thing that many have asked about is whether Cassandra with Hadoop will be 
usable from a random access perspective. For example, you may need to use 
Cassandra for serving web latency requests. You may also need to run analytics 
over your data. In Cassandra 0.7+ there is the !NetworkTopologyStrategy which 
allows you to customize your cluster's replication strategy by datacenter. What 
you can do with this is create a 'virtual datacenter' to separate nodes that 
serve data with high random-read performance from nodes that are meant to be 
used for analytics. You need to have a snitch configured with your topology and 
then according to the datacenters defined there (either explicitly or 
implicitly), you can indicate how many replicas you would like in each 
datacenter. You would install task trackers on nodes in your analytics section 
and make sure that a replica is written to that 'datacenter' in your 
!NetworkTopologyStrategy configuration. The practical upshot of this is your 
analytics nodes always have current data and your high random-read performance 
nodes always serve data with predictable performance.
- 
- For an example of configuring Cassandra with Hadoop in the cloud, see the 
[[http://github.com/digitalreasoning/PyStratus|PyStratus]] project on Github.
  
  [[#Top|Top]]

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Reply via email to