[
https://issues.apache.org/jira/browse/MAHOUT-1640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14379885#comment-14379885
]
ASF GitHub Bot commented on MAHOUT-1640:
----------------------------------------
GitHub user vigna opened a pull request:
https://github.com/apache/mahout/pull/81
Sparse-vector implementation based on fastutil (MAHOUT-1640)
The collections currently used by Mahout to implement sparse vectors are
extremely slow. The proposed patch (localized to RandomAccessSparseVector) uses
fastutil's maps and the speed improvements in vector benchmarks are very
significant. It would be interesting to see whether these improvements
percolate to high-level classes using sparse vectors.
I had to patch two unit tests (an off-by-one bug and an overfitting bug;
both were exposed by the different order in which key/values were returned by
iterators).
Some more speed might be gained by using everywhere the standard
java.util.Map.Entry interface instead of Element.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/vigna/mahout fastutil
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/mahout/pull/81.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #81
----
commit 632a01765a402678ece486ca836e8a215f5bea68
Author: Sebastiano Vigna <[email protected]>
Date: 2015-02-05T22:38:18Z
Sparse-vector implementation based on fastutil.
commit 047c87f20b0d79243eabdfae66d2ca29a5a1895f
Author: Sebastiano Vigna <[email protected]>
Date: 2015-03-25T13:45:44Z
Updated fastutil to 7.0.2.
----
> Better collections would significantly improve vector-operation speed
> ---------------------------------------------------------------------
>
> Key: MAHOUT-1640
> URL: https://issues.apache.org/jira/browse/MAHOUT-1640
> Project: Mahout
> Issue Type: Improvement
> Components: collections
> Environment: Darwin lithium.local 14.1.0 Darwin Kernel Version
> 14.1.0: Mon Dec 22 23:10:38 PST 2014; root:xnu-2782.10.72~2/RELEASE_X86_64
> x86_64 i386 MacBookPro10,1 Darwin
> java version "1.8.0_31"
> Java(TM) SE Runtime Environment (build 1.8.0_31-b13)
> Java HotSpot(TM) 64-Bit Server VM (build 25.31-b07, mixed mode)
> Reporter: Sebastiano Vigna
> Labels: legacy, math, scala
> Attachments: fastutil.patch, speed-fastutil, speed-std
>
>
> The collections currently used by Mahout to implement sparse vectors are
> extremely slow. The proposed patch (localized to RandomAccessSparseVector)
> uses fastutil's maps and the speed improvements in vector benchmarks are very
> significant. It would be interesting to see whether these improvements
> percolate to high-level classes using sparse vectors.
> I had to patch two unit tests (an off-by-one bug and an overfitting bug; both
> were exposed by the different order in which key/values were returned by
> iterators).
> The included files speed-std and speed-fastutil show the speed improvement.
> Some more speed might be gained by using everywhere the standard
> java.util.Map.Entry interface instead of Element.
> DISCLAIMER: The "Times" set of tests has been run multiplying two identical
> vectors. The standard tests multiply two random vectors, so in fact they just
> test the speed of the underlying map remove() method, as almost all products
> are zero. This is not very realistic and was heavily penalizing fastutil's
> "true deletions". Better tests, with a typical overlap of nonzero entries,
> would be even more realistic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)