[ https://issues.apache.org/jira/browse/MAHOUT-6?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12577181#action_12577181 ]
Jason Rennie commented on MAHOUT-6: ----------------------------------- David's right, I'm wrong---a primitive implementation didn't help (wrt speed). I tested the pcj IntKeyDoubleOpenHashMap vs. java.util.HashMap<Integer,Double>. Tried two load factors (I set initial capacity to 1/loadFactor): 25% and 50%. The two maps were about comparable for 25%. java.util.HashMap was faster for 50%. The CRS impl. was 2.3-3.6x faster than java.util.HashMap (after subtracting the SW "base" times). The details: /** * <ul> * <li> Finished HashMap dot product in 6.501 seconds * <li> Finished PrimitiveMap dot product in 6.589 seconds * <li> Finished CRS dot product in 2.632 seconds * <li> Finished SW base time in 1.162 seconds * <li> numTrials=1000000 vectorSize=1000 nnz1=100 nnz2=100 loadFactor=25% * </ul> * <ul> * <li> Finished HashMap dot product in 4.573 seconds * <li> Finished PrimitiveMap dot product in 4.033 seconds * <li> Finished CRS dot product in 2.290 seconds * <li> Finished SW base time in 1.103 seconds * <li> numTrials=1000000 vectorSize=1000 nnz1=200 nnz2=50 loadFactor=25% * </ul> * <ul> * <li> Finished HashMap dot product in 6.244 seconds * <li> Finished PrimitiveMap dot product in 7.417 seconds * <li> Finished CRS dot product in 2.575 seconds * <li> Finished SW base time in 1.054 seconds * <li> numTrials=1000000 vectorSize=1000 nnz1=100 nnz2=100 loadFactor=50% * </ul> * <ul> * <li> Finished HashMap dot product in 3.745 seconds * <li> Finished PrimitiveMap dot product in 4.249 seconds * <li> Finished CRS dot product in 2.270 seconds * <li> Finished SW base time in 1.164 seconds * <li> numTrials=1000000 vectorSize=1000 nnz1=200 nnz2=50 loadFactor=50% * </ul> */ public void testSparseVectorPerformance() throws Exception { StopWatch hmvSW = new StopWatch("HashMap dot product", log, false); StopWatch pmvSW = new StopWatch("PrimitiveMap dot product", log, false); StopWatch crsSW = new StopWatch("CRS dot product", log, false); StopWatch baseSW = new StopWatch("SW base time", log, false); final int numTrials = 1000000; final int vectorSize = 1000; final int nnz1 = 100; final int nnz2 = 100; final int sizeMultiple = 2; for (int i = 0; i < numTrials; ++i) { // ignore first 10% of iterations if (i == numTrials / 10) { hmvSW.reset(); pmvSW.reset(); crsSW.reset(); baseSW.reset(); } SparseVectorPrimitiveMap pmv1 = new SparseVectorPrimitiveMap(nnz1 * sizeMultiple); SparseVectorPrimitiveMap pmv2 = new SparseVectorPrimitiveMap(nnz2 * sizeMultiple); SparseVectorHashMap hmv1 = new SparseVectorHashMap(nnz1 * sizeMultiple); SparseVectorHashMap hmv2 = new SparseVectorHashMap(nnz2 * sizeMultiple); for (int j = 0; j < nnz1; ++j) { int index = this.rand.nextInt(vectorSize) + 1; double value = this.rand.nextDouble(); pmv1.set(index, value); hmv1.set(index, value); } for (int j = 0; j < nnz2; ++j) { int index = this.rand.nextInt(vectorSize) + 1; double value = this.rand.nextDouble(); pmv2.set(index, value); hmv2.set(index, value); } SparseVector crsv1 = pmv1.buildSparseVector(); SparseVector crsv2 = pmv2.buildSparseVector(); hmvSW.start(); hmv1.dot(hmv2); hmvSW.stop(); pmvSW.start(); pmv1.dot(pmv2); pmvSW.stop(); crsSW.start(); crsv1.dot(crsv2); crsSW.stop(); baseSW.start(); baseSW.stop(); } hmvSW.logEndMessage(); pmvSW.logEndMessage(); crsSW.logEndMessage(); baseSW.logEndMessage(); log.debug("numTrials=" + numTrials + " vectorSize=" + vectorSize + " nnz1=" + nnz1 + " nnz2=" + nnz2 + " loadFactor=" + (100 / sizeMultiple) + "%"); } > Need a matrix implementation > ---------------------------- > > Key: MAHOUT-6 > URL: https://issues.apache.org/jira/browse/MAHOUT-6 > Project: Mahout > Issue Type: New Feature > Reporter: Ted Dunning > Assignee: Grant Ingersoll > Attachments: MAHOUT-6a.diff, MAHOUT-6b.diff, MAHOUT-6c.diff, > MAHOUT-6d.diff, MAHOUT-6e.diff, MAHOUT-6f.diff, MAHOUT-6g.diff, > MAHOUT-6h.patch, MAHOUT-6i.diff, MAHOUT-6j.diff, MAHOUT-6k.diff, > MAHOUT-6l.patch > > > We need matrices for Mahout. > An initial set of basic requirements includes: > a) sparse and dense support are required > b) row and column labels are important > c) serialization for hadoop use is required > d) reasonable floating point performance is required, but awesome FP is not > e) the API should be simple enough to understand > f) it should be easy to carve out sub-matrices for sending to different > reducers > g) a reasonable set of matrix operations should be supported, these should > eventually include: > simple matrix-matrix and matrix-vector and matrix-scalar linear algebra > operations, A B, A + B, A v, A + x, v + x, u + v, dot(u, v) > row and column sums > generalized level 2 and 3 BLAS primitives, alpha A B + beta C and A u + > beta v > h) easy and efficient iteration constructs, especially for sparse matrices > i) easy to extend with new implementations -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.