Re: SparseVectors writing out a lot of data

Robin Anil Sun, 10 Jan 2010 13:04:09 -0800

20 newsgroups have 20000 documents with total of 40MB so that means average
of 2K per document i.e 300 word document (6 char per word + space)

when we convert to SparseVector, we have 20K vectors with 300 dimensions on
average and weighs 12bytes per entry
which should have 68MB instead of the 2GB which i am getting here.

For the 2GB problem I still have no clue whats getting written. Even the
sparseVector writing module seem to work fine.

For the SparseVectors becoming larger than the actual dataset problem. I
have the following thoughts

I used the VIntWritable and VLongWritable in the IntTupleWritable to
compress the space(variable 2-5 bytes to store integers) needed to represent
smaller integers. That gave me a lot of savings in PFPgrowth algorithm. Does
someone have a similar representation for double values. I mean 8 bytes is
too large to represent small values (do they need that level of precision
with the scale mahout is working in)? Is there a dual approach
(float+double) ?

Robin

On Sun, Jan 10, 2010 at 9:27 PM, Drew Farris <[email protected]> wrote:

> I've noticed the same thing when looking at SparseVectors contained
> withinthe results of ClusterDumper -- I didn't explore very far into why,
> but it seems that the json representation of the SparseVector doesn't use a
> map but instead uses parallel arrays of certain sizes. I'm not certain how
> the sizes are determined, but I assumed that this had something to do with
> how SparseVector is implemented.
>
> Perhaps this is/will be remedied in some of Jake's recent work?
>
> On Sun, Jan 10, 2010 at 9:43 AM, Robin Anil <[email protected]> wrote:
>
> > Lot of zeros being printed in the Json string. Is that normal for an
> > infinite cardinality vector?
> > http://pastebin.com/m6ff5f0ef
> > Same is true if I type cast to a Vector
> >
> >
> > On Sun, Jan 10, 2010 at 8:08 PM, Grant Ingersoll <[email protected]
> > >wrote:
> >
> > > Have you dumped out the file?  What's in it?
> > >
> > > Also, if you can use Vector instead of SparseVector in the API (it's
> fine
> > > to bind to SparseVector in the implementation) I think that would be
> > better.
> > >
> > > On Jan 10, 2010, at 7:00 AM, Robin Anil wrote:
> > >
> > > >
> > >
> >
> https://issues.apache.org/jira/secure/attachment/12429846/DictionaryVectorizer.patch
> > > >
> > > > Reduce => PartialVectorGenerator Class
> > >
> > >
> > >
> >
>

Re: SparseVectors writing out a lot of data

Reply via email to