On Tue, Jan 5, 2010 at 8:51 AM, Sean Owen <sro...@gmail.com> wrote: > > > Similarly, while write out can be the same for different Sparse impls, > one > > will be writing the index/value pairs in order, the other will not, and > > this > > will affect what needs to be done on reading in... > > You can always serialize the type of vector being written first or > something. This is what Java serialization does too. >
Yeah, there's nothing wrong with that, that's what I was thinking too. > The fact that we're led to java.io.Serializable reinforces the > question I've always had about this aspect of Hadoop -- what was so > unusable about the existing serialization mechanism? seems like a > needless reinvention, that foregoes some of the nice aspects of the > serialization mechanism. > > ... which further leads me to comment that the *best* way of > approaching all this would be to implement Serializable correctly. > Then generically create one Writable wrapper that leverages > Serializable to do its work. Then we have everything implemented > nicely. You can use Vectors in any context that leverages standard > serialization -- which is a big deal, for example, in J2EE. > > I stand by that until someone points out why this won't work or is > slow or something at runtime; don't see it yet. > I can certainly try and see how doing it that way helps. > I think the world of vectors probably does break down into, at most, > sparse and dense representations. So maybe there are at most two > serialization routines to write. Not bad. I don't really see what's so > wrong-ish about needing a serialization mechanism for every distinct > representation -- that would make logical sense at least. > > What other representations are we anticipating anyhow? > In MAHOUT-206, there comes two SparseVectors - one map-based, one array based. They are efficient in different ways. Other than that, yes, there is one other case I can think of off the top of my head: RandomVector - you don't need to keep more than a seed, and a couple of parameters, and it can reconstruct itself on the fly. There may be others which are the same data structure as the ones we have, but have methods overridden in funky ways, maybe. -jake