Re: Failure to run Clustering example

Grant Ingersoll Tue, 05 May 2009 18:15:38 -0700


On May 5, 2009, at 7:11 AM, Shashikant Kore wrote:

Here is a quick update.

I  wrote simple program to create lucene index from the text files and
then generate document vectors for these indexed documents.   I ran
K-means after creating canopies on 100 documents and it returned fine.

Here are some of the problems.
1.  As pointed out by Jeff, I need to maintain an external mapping of
document ID to vector mapping. But this requires some glue code
outside the clustering. Mahout-65 issue to handle that looks complext.
Instead, can I just add a label to a vector and then just change the
decodeVector() and asFormatString() methods to handle the label?

2. To create canopies for 1000 documents it took almost 75 minutes.
Though the total number of unique terms in the index is 50,000 each
vector has less than 100 unique terms. (ie each document vector is a
sparse vector of cardinality 50,000 and 100 elements.) The hardware is
admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor.
Hadoop has one node.  Values of T1 and T2 were 80 and 55 respectively,
as given in the sample program.

Have you profiled it? Would be good to see where the issue is comingfrom.

Re: Failure to run Clustering example

Reply via email to