[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720469#action_12720469
]
Grant Ingersoll commented on MAHOUT-121:
----------------------------------------
I've created some vectors from Wikipedia and can upload them. I just started
profiling using Canopy clustering 1000 Wikipedia docs and so far 87% of the
time (after 5% map phase) is spent in getQuick() and 80% of which is
Integer.valueOf(). I can run it longer, but I think it's safe to say it's the
bottleneck.
I'll try the patch and I'll upload the file and the params for running.
> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
> Issue Type: Improvement
> Components: Matrix
> Reporter: Shashikant Kore
> Attachments: mahout-121.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors.
> The complete dataset has few tens of thousands of feature items but each
> vector has only couple of hundred feature items. For this, there is an
> optimization in distance calculation, a link to which I found the archives of
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization. The test setup had 2000 document vectors
> with few hundred items. I ran canopy generation with Euclidean distance and
> t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives
> is computationally expensive. I changed the sparse vector implementation to
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.
>
> Licensing of Trove seems to be an issue which needs to be addressed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.