[Cassandra Wiki] Trivial Update of "HadoopSupport" by jeremyhanna

Apache Wiki Tue, 01 Mar 2011 16:09:20 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna.
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=21&rev2=22

--------------------------------------------------

  }}}
  
  ==== Virtual Datacenter ====
- One thing that many have asked about is whether Cassandra with Hadoop will be 
usable from a random access perspective. For example, you may need to use 
Cassandra for serving web latency requests. You may also need to run analytics 
over your data. In Cassandra 0.7+ there is the NetworkTopologyStrategy which 
allows you to customize your cluster's replication strategy by datacenter. What 
you can do with this is create a 'virtual datacenter' to separate nodes that 
serve data with high random-read performance from nodes that are meant to be 
used for analytics. You need to have a snitch configured with your topology and 
then according to the datacenters defined there (either explicitly or 
implicitly), you can indicate how many replicas you would like in each 
datacenter. You would install task trackers on nodes in your analytics section 
and make sure that a replica is written to that 'datacenter' in your 
NetworkTopologyStrategy configuration. The practical upshot of this is your 
analytics nodes always have current data and your high random-read performance 
nodes always serve data with predictable performance.
+ One thing that many have asked about is whether Cassandra with Hadoop will be 
usable from a random access perspective. For example, you may need to use 
Cassandra for serving web latency requests. You may also need to run analytics 
over your data. In Cassandra 0.7+ there is the !NetworkTopologyStrategy which 
allows you to customize your cluster's replication strategy by datacenter. What 
you can do with this is create a 'virtual datacenter' to separate nodes that 
serve data with high random-read performance from nodes that are meant to be 
used for analytics. You need to have a snitch configured with your topology and 
then according to the datacenters defined there (either explicitly or 
implicitly), you can indicate how many replicas you would like in each 
datacenter. You would install task trackers on nodes in your analytics section 
and make sure that a replica is written to that 'datacenter' in your 
!NetworkTopologyStrategy configuration. The practical upshot of this is your 
analytics nodes always have current data and your high random-read performance 
nodes always serve data with predictable performance.
  
  For an example of configuring Cassandra with Hadoop in the cloud, see the 
[[http://github.com/digitalreasoning/PyStratus|PyStratus]] project on Github.

[Cassandra Wiki] Trivial Update of "HadoopSupport" by jeremyhanna

Reply via email to