[
https://issues.apache.org/jira/browse/MAHOUT-121?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745082#action_12745082
]
Rob Eden commented on MAHOUT-121:
---------------------------------
Hi guys. I'm the lead developer for the Trove project. Shashi mentioned the
problem you're having with Trove's license. I appreciate the interest and would
like to accommodate your usage. I'm going to speak to the original developer
about dual-licensing.
My specific question is, would the MPL be an acceptable license for usage or
does it have to be APL? (Not implying that I necessarily have a problem with
APL... just checking options.)
> Speed up distance calculations for sparse vectors
> -------------------------------------------------
>
> Key: MAHOUT-121
> URL: https://issues.apache.org/jira/browse/MAHOUT-121
> Project: Mahout
> Issue Type: Improvement
> Components: Matrix
> Reporter: Shashikant Kore
> Assignee: Grant Ingersoll
> Attachments: Canopy_Wiki_1000-2009-06-24.snapshot, doc-vector-4k,
> MAHOUT-121-cluster-distance.patch, MAHOUT-121-distance-optimization.patch,
> MAHOUT-121-new-distance-optimization.patch, mahout-121.patch,
> MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch, MAHOUT-121.patch,
> MAHOUT-121.patch, mahout-121.patch, MAHOUT-121jfe.patch, Mahout1211.patch
>
>
> From my mail to the Mahout mailing list.
> I am working on clustering a dataset which has thousands of sparse vectors.
> The complete dataset has few tens of thousands of feature items but each
> vector has only couple of hundred feature items. For this, there is an
> optimization in distance calculation, a link to which I found the archives of
> Mahout mailing list.
> http://lingpipe-blog.com/2009/03/12/speeding-up-k-means-clustering-algebra-sparse-vectors/
> I tried out this optimization. The test setup had 2000 document vectors
> with few hundred items. I ran canopy generation with Euclidean distance and
> t1, t2 values as 250 and 200.
>
> Current Canopy Generation: 28 min 15 sec.
> Canopy Generation with distance optimization: 1 min 38 sec.
> I know by experience that using Integer, Double objects instead of primitives
> is computationally expensive. I changed the sparse vector implementation to
> used primitive collections by Trove [
> http://trove4j.sourceforge.net/ ].
> Distance optimization with Trove: 59 sec
> Current canopy generation with Trove: 21 min 55 sec
> To sum, these two optimizations reduced cluster generation time by a 97%.
> Currently, I have made the changes for Euclidean Distance, Canopy and KMeans.
>
> Licensing of Trove seems to be an issue which needs to be addressed.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.