Re: Vector creation - out of memory error

Grant Ingersoll Tue, 21 Jul 2009 14:22:19 -0700


On Jul 21, 2009, at 2:14 PM, Florian Leibert wrote:

Hi Shashi,
great - I'm trying the settings maxDFPercent 50 and minDF 4 - I havea lot
of very short documents of which some can be very descriptive.
I'm thinking I should have used the StopWordAnalyzer in Lucene whencreating
the index - that way the creation of the vectors would be much faster.
It took yesterday about 8 hours to process these vectors on a quadcoremachine with 4 GB of heap - using the sequence file writer - Iassume that
the bottleneck might have been the constant transfer into HDFS

You might try writing out to a local path and then simply copying toHDFS once it is done.

- that's why
I'm using the file writer now. It's running on my 6 GB index sinceabout 90minutes now and while the vector sequence file yesterday was 3 GBlarge(without filtering) - the JSON file is already at 16 GB (withfiltering) -
which I attribute to the compression of the sequence file...

The JSON file isn't going to do much for you in terms of actuallyclustering, if that is what you are after. The clustering algorithmswork on SequenceFiles only. The JSON stuff is only really useful forhuman consumption, I guess.

I'm trying to allot some time to transform the vector creationprocess to
M/R if nobody else is working on that at the moment...

That would be great, but likely somewhat tricky with the Luceneindex. Note, also that the current approach is just one approach forcreating the matrix. You certainly don't have to go through Lucene.You could implement other types of VectorIterables, or just forgo thatall together and create the sequence file on you own.


-Grant

Re: Vector creation - out of memory error

Reply via email to