Re: hadoop tasks reading from cassandra

Jonathan Ellis Fri, 24 Jul 2009 10:00:52 -0700

On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<[email protected]> wrote:
> 1. In addition to OrderPreservingPartitioner, it would be useful to support
> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype
> that sort-of works at this moment. The difficulty with random partitioner
> is that it's a bit hard to generate the splits. In our prototype, we simply
> map each row to a split. This is ok for fat rows (e.g., a row includes all
> info for a user), but may be too fine-grained for other cases. Another
> possibility is to generate a split that corresponds to a set of rows in a
> hash-range (instead of key range). This requires some new apis in
> cassandra.


-1 on adding new apis to pound a square peg into a round hole.

like range queries, hadoop splits only really make sense on OPP.

> 2. For better performance, in the future, it would be useful to expose and
> exploit data locality in cassandra so that a map task is executed on a
> cassandra node that owns the data locally. A related issue is
> https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks
> encapsulation, but it's worth thinking about. Google's DFS and Bigtable
> both expose certain locality info for better performance.

That's why I'd like to ship hadoop integration out of the box, instead
of adding apis that should really be internal-use only for an external
hadoop layer.

-Jonathan

Re: hadoop tasks reading from cassandra

Reply via email to