On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<[email protected]> wrote: > 1. In addition to OrderPreservingPartitioner, it would be useful to support > MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype > that sort-of works at this moment. The difficulty with random partitioner > is that it's a bit hard to generate the splits. In our prototype, we simply > map each row to a split. This is ok for fat rows (e.g., a row includes all > info for a user), but may be too fine-grained for other cases. Another > possibility is to generate a split that corresponds to a set of rows in a > hash-range (instead of key range). This requires some new apis in > cassandra.
-1 on adding new apis to pound a square peg into a round hole. like range queries, hadoop splits only really make sense on OPP. > 2. For better performance, in the future, it would be useful to expose and > exploit data locality in cassandra so that a map task is executed on a > cassandra node that owns the data locally. A related issue is > https://issues.apache.org/jira/browse/CASSANDRA-197. It breaks > encapsulation, but it's worth thinking about. Google's DFS and Bigtable > both expose certain locality info for better performance. That's why I'd like to ship hadoop integration out of the box, instead of adding apis that should really be internal-use only for an external hadoop layer. -Jonathan
