[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720798#action_12720798
]
Sean Owen commented on MAHOUT-121:
----------------------------------
I see, i did not even know this utils/ directory was added. I'll get it into my
client and update it accordingly.
I didn't see a failure in that test, but, do see the problem -- it uses a
random number generator without a seed. I will add a seed, and see if I can
find one that induces a failure.
To avoid confusion... I suppose let me keep the 'official' version of the patch
for now since I need to make some fixes. It is still substantially identical to
the last couple that have been posted, so, reviewing those is valid for anyone
who cares.
> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
> Issue Type: Improvement
> Components: Matrix
> Reporter: Shashikant Kore
> Attachments: MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
> mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors.
> The complete dataset has few tens of thousands of feature items but each
> vector has only couple of hundred feature items. For this, there is an
> optimization in distance calculation, a link to which I found the archives of
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization. The test setup had 2000 document vectors
> with few hundred items. I ran canopy generation with Euclidean distance and
> t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives
> is computationally expensive. I changed the sparse vector implementation to
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.
>
> Licensing of Trove seems to be an issue which needs to be addressed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.