[Cassandra Wiki] Trivial Update of "HadoopSupport" by j eremyhanna

Apache Wiki Wed, 16 Jun 2010 16:02:54 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "HadoopSupport" page has been changed by jeremyhanna.
http://wiki.apache.org/cassandra/HadoopSupport?action=diff&rev1=10&rev2=11

--------------------------------------------------

              SlicePredicate predicate = new 
SlicePredicate().setColumn_names(Arrays.asList(columnName.getBytes()));
              ConfigHelper.setSlicePredicate(job.getConfiguration(), predicate);
  }}}
- Cassandra's splits are location-aware (this is the nature of the Hadoop 
InputSplit design).  Cassandra  gives the Hadoop !JobTracker a list of 
locations with each split of data.  That way, the !JobTracker can try to 
preserve data locality when  assigning tasks to !TaskTrackers.  Therefore, when 
using Hadoop alongside  Cassandra, it is best to have a !TaskTracker running on 
the same node as  the Cassandra nodes, if data locality while processing is 
desired and to  minimize copying data between Cassandra and Hadoop nodes.
+ Cassandra's splits are location-aware (this is the nature of the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/InputSplit.html|InputSplit]]
 design).  Cassandra  gives the Hadoop 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/JobTracker.html|JobTracker]]
 a list of locations with each split of data.  That way, the !JobTracker can 
try to preserve data locality when  assigning tasks to 
[[http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/TaskTracker.html|TaskTracker]]s.
  Therefore, when using Hadoop alongside  Cassandra, it is best to have a 
!TaskTracker running on the same node as  the Cassandra nodes, if data locality 
while processing is desired and to  minimize copying data between Cassandra and 
Hadoop nodes.
  
  As of 0.7, there will be a basic mechanism included in Cassandra for  
outputting data to cassandra.  See 
[[https://issues.apache.org/jira/browse/CASSANDRA-1101|CASSANDRA-1101]]  for 
details.
  
- Releases before  0.6.2/0.7 are affected by a small  resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you  may hit this issue, and workaround it 
by raising the limit of open file  descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on  the hadoop 
job side as a thrift TimedOutException.
+ Releases before  0.6.2/0.7 are affected by a small  resource leak that may 
cause jobs to fail (connections are not released  properly, causing a resource 
leak). Depending on your local setup you  may hit this issue, and workaround it 
by raising the limit of open file  descriptors for the process (e.g. in 
linux/bash using `ulimit -n 32000`).  The error will be reported on  the hadoop 
job side as a thrift !TimedOutException.
  
  If you are testing the integration against a single node and you obtain  some 
failures, this may be normal: you are probably overloading the  single machine, 
which may again result in timeout errors. You can  workaround it by reducing 
the number of concurrent tasks
  
@@ -31, +31 @@

               ConfigHelper.setRangeBatchSize(job.getConfiguration(), 1000);
  }}}
  == Pig ==
- Cassandra 0.6 also adds support for [[http://hadoop.apache.org/pig/|Pig]] 
with its own implementation of LoadFunc.  This allows Pig queries to be run 
against data stored in Cassandra.  For an example of this, see the contrib/pig 
example in 0.6 and later.
+ Cassandra 0.6 also adds support for [[http://hadoop.apache.org/pig/|Pig]] 
with its own implementation of 
[[http://hadoop.apache.org/pig/docs/r0.7.0/api/org/apache/pig/LoadFunc.html|LoadFunc]].
  This allows Pig queries to be run against data stored in Cassandra.  For an 
example of this, see the contrib/pig example in 0.6 and later.
  
  == Hive ==
  Hive is currently not supported in Cassandra but there has been thought given 
to support Hive in the future - 
[[https://issues.apache.org/jira/browse/CASSANDRA-913|CASSANDRA-913]]

[Cassandra Wiki] Trivial Update of "HadoopSupport" by j eremyhanna

Reply via email to