Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "HadoopSupport" page has been changed by jeremyhanna. The comment on this change is: adding an initial cluster configuration section.. http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=15&rev2=16 -------------------------------------------------- * [[#MapReduce|MapReduce Support]] * [[#Pig|Pig Support]] * [[#Hive|Hive Support]] + * [[#ClusterConfig|Cluster Configuration]] <<Anchor(Overview)>> @@ -73, +74 @@ [[#Top|Top]] + <<Anchor(ClusterConfig)>> + + == Cluster Configuration == + + If you would like to configure a Cassandra cluster so that Hadoop may operate over its data, it's best to overlay a Hadoop cluster over your Cassandra nodes. You'll want to have a separate server for your Hadoop `namenode`/`jobtracker`. Then install Hadoop `tasktracker`s on each of your Cassandra nodes. That will allow the `jobtracker` to assign tasks to the Cassandra nodes that contain data for those tasks. At least one node in your cluster will also need to be a `datanode`. That's because Hadoop uses HDFS to store information like jar dependencies for your job, static data (like stop words for a word count), and things like that - it's the distributed cache. It's a very small amount of data but the Hadoop cluster needs it to run properly. + + The nice thing about having `tasktracker`s on every node is that 1, you get data locality and 2, your analytics engine scales with your data. + + [[#Top|Top]] +
