[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Jake Mannix (JIRA) Wed, 11 Nov 2009 15:49:14 -0800

    [ 
https://issues.apache.org/jira/browse/MAHOUT-165?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776748#action_12776748
 ]


Jake Mannix commented on MAHOUT-165:
------------------------------------

*bump*

So what is the collective vision on this now?  Shashi says that this current 
impl (from this patch) is pretty slow, but I was thinking that it is actually 
just some bugs in the IntDoubleMap based impl (mentioned above).  Has anyone 
looked at that other vector implementation?  Once we have this patch working, 
we can compare it to colt, and see how they stand up.  

The question remains whether we should consider Commons-Math as our underlying 
linear package.  We can't really use cmath 2.0, because their api is missing 
some key features we need (iterators, for one thing, and they only have one 
sparse vector impl, based on the map, and don't also have an IntDoubleMapping 
based one, which is pretty key for performance of some sparse algs).  They do 
have lots of great solvers, but I've been seeing a disturbing number of bugs 
fixed on their SVD implementation lately (I mean, fixing bugs is great, but 
having them in there means we don't know how many more there are), and the 
impls there are translated from fortran, and the code is very dense and hard 
for me to debug if I see a bug pop up, personally.

The other option is to just get something like MTJ, but they depend on some f2j 
stuff, which is keeping commons-math from taking them, even though MTJ itself 
can be appropriately licensed.

Personally, while I hate reinventing wheels, I hate even more having to depend 
on other libraries which don't quite have the api we need or the functionality 
we need, when the primitives involved are so core to much of what we want to 
do.  So I'm more in favor of doing one of two things:  

 * use our own primitives as is, and improve them, remembering that our core 
competency is on *scaling*, so optimizing for the 1000-element dense double[] 
vectors and 10k double[][] matrices isn't where we care as much, and most 
linear libraries aren't in the same mode of thinking.

 * rip the unacceptably licensed parts of Colt out and use the rest for our 
linear routines.  The hep.aida.* packages are not needed and can be removed 
without losing the key functionality, and these packages are the only ones 
which aren't apache compatible.

Thoughts?  We need to settle on at least a wrapper linear API which we'll 
delegate to if we're going to allow that possibility, but even that involves a 
bunch of decisions on what should be in that api - not all implementations can 
do everything, so this form of flexibility comes at a price. 

> Using better primitives hash for sparse vector for performance gains
> --------------------------------------------------------------------
>
>                 Key: MAHOUT-165
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-165
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Matrix
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.3
>
>         Attachments: colt.jar, mahout-165-trove.patch, 
> MAHOUT-165-updated.patch, mahout-165.patch, MAHOUT-165.patch, mahout-165.patch
>
>
> In SparseVector, we need primitives hash map for index and values. The 
> present implementation of this hash map is not as efficient as some of the 
> other implementations in non-Apache projects. 
> In an experiment, I found that, for get/set operations, the primitive hash of 
>  Colt performance an order of magnitude better than OrderedIntDoubleMapping. 
> For iteration it is 2x slower, though. 
> Using Colt in Sparsevector improved performance of canopy generation. For an 
> experimental dataset, the current implementation takes 50 minutes. Using 
> Colt, reduces this duration to 19-20 minutes. That's 60% reduction in the 
> delay. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-165) Using better primitives hash for sparse vector for performance gains

Reply via email to