On Wed, Jul 29, 2009 at 1:37 AM, Jeff Hodges<[email protected]> wrote: > Comments inline. > > On Fri, Jul 24, 2009 at 10:00 AM, Jonathan Ellis<[email protected]> wrote: >> On Fri, Jul 24, 2009 at 11:08 AM, Jun Rao<[email protected]> wrote: >>> 1. In addition to OrderPreservingPartitioner, it would be useful to support >>> MapReduce on RandomPartitioned Cassandra as well. We had a rough prototype >>> that sort-of works at this moment. The difficulty with random partitioner >>> is that it's a bit hard to generate the splits. In our prototype, we simply >>> map each row to a split. This is ok for fat rows (e.g., a row includes all >>> info for a user), but may be too fine-grained for other cases. Another >>> possibility is to generate a split that corresponds to a set of rows in a >>> hash-range (instead of key range). This requires some new apis in >>> cassandra. >> >> -1 on adding new apis to pound a square peg into a round hole. >> >> like range queries, hadoop splits only really make sense on OPP. >> > > Why would it only make sense on OPP? If it wasn't an externally > exposed part of the api, what other concerns do you have about a hash > range query? I can't think of any beyond the usual increased code > complexity argument (i.e. development, testing and maintenance costs > for it).
Because you have to violate encapsulation pretty badly and provide ops acting on a hash instead of a key, so you'd be providing a parallel, public api that only applies to the hash partitioner. It's a bad enough hack that I'd say "feel free to maintain that in your own tree, but not in the public repo." :) > There is something in Hadoop that attempts to solve some of the data > locality problem called NetworkTopology. It's used to provide data > locality for CompileFileInputFormat (among, I'm sure, other things). > > Combining this with the knowledge we would have of which Node each key > range would be from, there is a chance Hadoop could do some of the > locality work for us. Looking at the code for CombineFileInputFormat, > it doesn't seem to be particularly straightforward bit of work to > translate to Cassandra, but I'm sure with a little time and maybe a > little guidance from some Hadoop folks, we could make it happen. > > In any case, this seems to be evidence that locality can be added on > later. It will not be a simple drop in deal, but it wouldn't seem to > require us to completely overhaul how we think about the input > splitting. Jun mentioned #197 -- I'm still -1 on adding such a beast to the thrift API, but I think it would be ok to expose it in get_string_property, suitably (json?) encoded. > (Oh, and has anyone got a mnemonic or anything to remember which of > org.apache.hadoop.mapred and org.apache.hadoop.mapreduce is the new > one? I'll be jiggered if I can keep it straight.) mapreduce is the new one. they got lucky and left the full name open for their second try. :) -Jonathan
