On May 22, 2009, at 6:52 AM, Shashikant Kore wrote:

Hi,

I am working on clustering a dataset which has thousands of sparse
vectors. The complete dataset has few tens of thousands of feature
items but each vector has only couple of hundred feature items. For
this, there is an optimization in distance calculation, a link to
which I found the archives of Mahout mailing list.

http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/

I tried out this optimization.  The test setup had 2000 document
vectors with few hundred items.  I ran canopy generation with
Euclidean distance and t1, t2 values as 250 and 200.

Current Canopy Generation: 28 min 15 sec.
Canopy Generation with distance optimization: 1 min 38 sec.


Very cool.

I know by experience that using Integer, Double objects instead of
primitives is computationally expensive. I changed the sparse vector
implementation to used primitive collections by Trove [
http://trove4j.sourceforge.net/ ].

Distance optimization with Trove: 59 sec
Current canopy generation with Trove: 21 min 55 sec

To sum, these two optimizations reduced cluster generation time by a 97%.

Currently, I have made the changes for Euclidean Distance, Canopy and
KMeans.  How do we go about pushing these changes to Mahout?


http://cwiki.apache.org/MAHOUT/howtocontribute.html

It's a bit complicated by Trove, b/c that is LGPL. What that means, unfortunately, is that we can't check it into our code or distribute it. However, if it is in a Maven repo somewhere (I see an old version) than it is easier to include. I haven't looked at the code, but is it possible that http://commons.apache.org/primitives/ fills the same role or some other library out there that has a more friendly license?

Regardless of these, feel free to submit a patch, so we can at least look at it and have something concrete to discuss in JIRA.

Thanks,
Grant

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to