On Jul 21, 2009, at 2:14 PM, Florian Leibert wrote:
Hi Shashi,
great - I'm trying the settings maxDFPercent 50 and minDF 4 - I have
a lot
of very short documents of which some can be very descriptive.
I'm thinking I should have used the StopWordAnalyzer in Lucene when
creating
the index - that way the creation of the vectors would be much faster.
It took yesterday about 8 hours to process these vectors on a quad
core
machine with 4 GB of heap - using the sequence file writer - I
assume that
the bottleneck might have been the constant transfer into HDFS
You might try writing out to a local path and then simply copying to
HDFS once it is done.
- that's why
I'm using the file writer now. It's running on my 6 GB index since
about 90
minutes now and while the vector sequence file yesterday was 3 GB
large
(without filtering) - the JSON file is already at 16 GB (with
filtering) -
which I attribute to the compression of the sequence file...
The JSON file isn't going to do much for you in terms of actually
clustering, if that is what you are after. The clustering algorithms
work on SequenceFiles only. The JSON stuff is only really useful for
human consumption, I guess.
I'm trying to allot some time to transform the vector creation
process to
M/R if nobody else is working on that at the moment...
That would be great, but likely somewhat tricky with the Lucene
index. Note, also that the current approach is just one approach for
creating the matrix. You certainly don't have to go through Lucene.
You could implement other types of VectorIterables, or just forgo that
all together and create the sequence file on you own.
-Grant