20 newsgroups have 20000 documents with total of 40MB so that means average of 2K per document i.e 300 word document (6 char per word + space)
when we convert to SparseVector, we have 20K vectors with 300 dimensions on average and weighs 12bytes per entry which should have 68MB instead of the 2GB which i am getting here. For the 2GB problem I still have no clue whats getting written. Even the sparseVector writing module seem to work fine. For the SparseVectors becoming larger than the actual dataset problem. I have the following thoughts I used the VIntWritable and VLongWritable in the IntTupleWritable to compress the space(variable 2-5 bytes to store integers) needed to represent smaller integers. That gave me a lot of savings in PFPgrowth algorithm. Does someone have a similar representation for double values. I mean 8 bytes is too large to represent small values (do they need that level of precision with the scale mahout is working in)? Is there a dual approach (float+double) ? Robin On Sun, Jan 10, 2010 at 9:27 PM, Drew Farris <[email protected]> wrote: > I've noticed the same thing when looking at SparseVectors contained > withinthe results of ClusterDumper -- I didn't explore very far into why, > but it seems that the json representation of the SparseVector doesn't use a > map but instead uses parallel arrays of certain sizes. I'm not certain how > the sizes are determined, but I assumed that this had something to do with > how SparseVector is implemented. > > Perhaps this is/will be remedied in some of Jake's recent work? > > On Sun, Jan 10, 2010 at 9:43 AM, Robin Anil <[email protected]> wrote: > > > Lot of zeros being printed in the Json string. Is that normal for an > > infinite cardinality vector? > > http://pastebin.com/m6ff5f0ef > > Same is true if I type cast to a Vector > > > > > > On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll <[email protected] > > >wrote: > > > > > Have you dumped out the file? What's in it? > > > > > > Also, if you can use Vector instead of SparseVector in the API (it's > fine > > > to bind to SparseVector in the implementation) I think that would be > > better. > > > > > > On Jan 10, 2010, at 7:00 AM, Robin Anil wrote: > > > > > > > > > > > > > https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch > > > > > > > > Reduce => PartialVectorGenerator Class > > > > > > > > > > > >
