[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Apache Wiki Mon, 25 Jul 2011 12:43:25 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna:
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=35&rev2=36

Comment:
Removing old troubleshooting tip about pre 0.6.2 connection leak and added 
remarks about range scans and CL.

   * '''cassandra.range.batch.size''' - the default is 4096, but you may need 
to lower this depending on your data.  This is either specified in your hadoop 
configuration or using 
`org.apache.cassandra.hadoop.ConfigHelper.setRangeBatchSize`.
   * '''rpc_timeout_in_ms''' - this is set in your `cassandra.yaml` (in 0.6 
it's `RpcTimeoutInMillis` in `storage-conf.xml`).  The rpc timeout is not for 
timing out from the client but between nodes.  This can be increased to reduce 
chances of timing out.
  
+ If you are seeing inconsistent data coming back, consider the consistency 
level that you are reading and writing at.  The two relevant properties are:
+  * '''cassandra.consistencylevel.read''' - defaults to !ConsistencyLevel.ONE.
+  * '''cassandra.consistencylevel.write''' - defaults to !ConsistencyLevel.ONE.
+ Also since hadoop integration uses range scans underneath which do not do 
read repair.  However reading at !ConsistencyLevel.QUORUM will reconcile 
differences among nodes read.  See ReadRepair section as well as the 
!ConsistencyLevel section of the [[http://wiki.apache.org/cassandra/API|API]] 
page for more details.
- Releases before 0.6.2/0.7 are affected by a small resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you may hit this issue, and workaround it 
by raising the limit of open file descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on the hadoop 
job side as a thrift !TimedOutException.
- 
- If you are testing the integration against a single node and you obtain some 
failures, this may be normal: you are probably overloading the single machine, 
which may again result in timeout errors. You can workaround it by reducing the 
number of concurrent tasks
- 
- {{{
-              Configuration conf = job.getConfiguration();
-              conf.setInt("mapred.tasktracker.map.tasks.maximum",1);
- }}}
- Also, you may reduce the size in rows of the batch you  are reading from 
cassandra
- 
- {{{
-              ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
- }}}
  
  [[#Top|Top]]

[Cassandra Wiki] Update of "HadoopSupport" by jeremyhanna

Reply via email to