On Tue, Jan 5, 2010 at 10:02 AM, Drew Farris <drew.far...@gmail.com> wrote:
> On Tue, Jan 5, 2010 at 11:46 AM, Drew Farris <drew.far...@gmail.com> > wrote: > > > > > Have you seen any cases where a class hierarchy of Writables is > > established to do something like that? E.g the mapreduce jobs are > > written to use VectorWritable, but subclasses (e.g > > SparseVectorWritable) are available for specific needs? > > > > > Bah, nevermind -- this is precisely what Mahout does today without > separating the Vector and Writable portions into two separate classes. > Serious brain lapse that one. > Yeah, that was what isn't working well - Hadoop likes to check exact match on classes, and kills proper OOD. There may be a reason for it, but I can't see it. > Of course this would probably be a very straightforward approach to > implement: Simply separate out the Writable portions of each Vector > implementation into its own class. The Writable implementation to use would > specified at runtime and this would also determine which underlying Vector > implementation is used. The price we pay for separating the Writable stuff > from the Vectors is an extra class that implements Writable for each > implementation. Since the Writable (an thus implementation) to use is > specified at runtime via options, there's no need for an ugly switch > statement anywhere. > How would you specify which Writable implementation at runtime? You have Mapper and Reducers which are keyed on Writable types... you need to pick which one to use. > Theoretically one could even decouple the writable (serialization style) > from the (in-memory) implementation, but I don't know if there is any need > for that whatsoever. > Yeah, I'd like this, because the two different SparseVector impls have different in-memory structure, but basically the same serialization (key-value pairs of int and double). I think I can work around a way to get this to work. Just not sure how ugly it would get. -jake