On May 5, 2009, at 7:11 AM, Shashikant Kore wrote:
Here is a quick update. I wrote simple program to create lucene index from the text files and then generate document vectors for these indexed documents. I ran K-means after creating canopies on 100 documents and it returned fine. Here are some of the problems. 1. As pointed out by Jeff, I need to maintain an external mapping of document ID to vector mapping. But this requires some glue code outside the clustering. Mahout-65 issue to handle that looks complext. Instead, can I just add a label to a vector and then just change the decodeVector() and asFormatString() methods to handle the label? 2. To create canopies for 1000 documents it took almost 75 minutes. Though the total number of unique terms in the index is 50,000 each vector has less than 100 unique terms. (ie each document vector is a sparse vector of cardinality 50,000 and 100 elements.) The hardware is admittedly "low-end" with 1G RAM and 1.6GHz dual-core processor. Hadoop has one node. Values of T1 and T2 were 80 and 55 respectively, as given in the sample program.
Have you profiled it? Would be good to see where the issue is coming from.
