On Jul 21, 2009, at 2:14 PM, Florian Leibert wrote:

Hi Shashi,
great - I'm trying the settings maxDFPercent 50 and minDF 4 - I have a lot
of very short documents of which some can be very descriptive.
I'm thinking I should have used the StopWordAnalyzer in Lucene when creating
the index - that way the creation of the vectors would be much faster.

It took yesterday about 8 hours to process these vectors on a quad core machine with 4 GB of heap - using the sequence file writer - I assume that
the bottleneck might have been the constant transfer into HDFS

You might try writing out to a local path and then simply copying to HDFS once it is done.

- that's why
I'm using the file writer now. It's running on my 6 GB index since about 90 minutes now and while the vector sequence file yesterday was 3 GB large (without filtering) - the JSON file is already at 16 GB (with filtering) -
which I attribute to the compression of the sequence file...

The JSON file isn't going to do much for you in terms of actually clustering, if that is what you are after. The clustering algorithms work on SequenceFiles only. The JSON stuff is only really useful for human consumption, I guess.



I'm trying to allot some time to transform the vector creation process to
M/R if nobody else is working on that at the moment...


That would be great, but likely somewhat tricky with the Lucene index. Note, also that the current approach is just one approach for creating the matrix. You certainly don't have to go through Lucene. You could implement other types of VectorIterables, or just forgo that all together and create the sequence file on you own.

-Grant


Reply via email to