[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Apache Wiki Mon, 26 Sep 2011 10:43:46 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=39&rev2=40

Comment:
Adding more troubleshooting information and a caveat to OSS Brisk in the main 
description

  == Overview ==
  Cassandra 0.6+ enables certain [[http://hadoop.apache.org/|Hadoop]] 
functionality against Cassandra's data store.  Specifically, support has been 
added for [[http://hadoop.apache.org/mapreduce/|MapReduce]], 
[[http://pig.apache.org|Pig]] and [[http://hive.apache.org/|Hive]].
  
- [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) 
+ [[http://datastax.com|DataStax]] has open-sourced a Cassandra based Hadoop 
distribution called Brisk. 
([[http://www.datastax.com/docs/0.8/brisk/index|Documentation]]) 
([[http://github.com/riptano/brisk|Code]]) However this code is no longer going 
to be maintained by DataStax.  Future development of Brisk is now part of a 
pay-for offering.
  
  [[#Top|Top]]
  
@@ -92, +92 @@

   * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
  
+ If you still see timeout exceptions with resultant failed jobs and/or 
blacklisted tasktrackers, there are settings that can give Cassandra more 
latitude before failing the jobs.  An example of usage (in either the job 
configuration or taskracker mapred-site.xml):
+ {{{
+ <property>
+   <name>mapred.max.tracker.failures</name>
+   <value>20</value>
+ </property>
+ <property>
+   <name>mapred.max.tracker.failures</name>
+   <value>20</value>
+ </property>
+ <property>
+   <name>mapred.map.max.attempts</name>
+   <value>20</value>
+ </property>
+ <property>
+   <name>mapred.reduce.max.attempts</name>
+   <value>20</value>
+ </property>
+ }}}
+ The settings normally default to 4 each, but some find that too conservative. 
 If you set it too low, you might have blacklisted tasktrackers and failed jobs 
because of occasional timeout exceptions.  If you set them too high, jobs that 
would otherwise fail quickly take a long time to fail, sacrificing efficiency.  
Keep in mind that this can just cover a problem.  It may be that you always 
want these settings to be higher when operating against Cassandra.  However, if 
you run into these exceptions too frequently, there may be a problem with your 
Cassandra or Hadoop configuration.
+ 
  If you are seeing inconsistent data coming back, consider the consistency 
level that you are reading and writing at.  The two relevant properties are:
   * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
   * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Reply via email to