[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744980#action_12744980
]
Sean Owen commented on MAHOUT-121:
----------------------------------
These two implementations use a different algorithm. This implemented is the
analog of TreeSet in Java Collections, whereas you are talking about an analog
of HashMap.
A hash is likely going to be faster for gets. I imagine it's slower when
required to produce the keys/values in order. Which did you benchmark? let's
make sure we are asking the right question first.
Of course, if we find that in fact we far more often need fast gets than fast
iteration, we do want a hash-based implementation instead. Is that the
conclusion? (And could we make a new issue to continue that!)
We can't use Trove of course. The parts of colt we need appear to be
Apache-compatible. That seems like a fine thing to try next.
If it's not suitable, we can easily roll our own primitive-based hash. We
already have a long->Object map implementation.
> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
> Issue Type: Improvement
> Components: Matrix
> Reporter: Shashikant Kore
> Assignee: Grant Ingersoll
> Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
> MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch,
> MAHOUT-121-new-distance-optimization.patch, mahout-121.patch,
> MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
> MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors.
> The complete dataset has few tens of thousands of feature items but each
> vector has only couple of hundred feature items. For this, there is an
> optimization in distance calculation, a link to which I found the archives of
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization. The test setup had 2000 document vectors
> with few hundred items. I ran canopy generation with Euclidean distance and
> t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives
> is computationally expensive. I changed the sparse vector implementation to
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.
>
> Licensing of Trove seems to be an issue which needs to be addressed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.