[
https://issues.apache.org/jira/browse/MAHOUT-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13630519#comment-13630519
]
Robin Anil commented on MAHOUT-1190:
------------------------------------
Sure I agree with that, but for practical POV, when we do ML, are we looking at
50Mil non zeros I mean with hashing trick, we have seen that same amount of
quality can be generated faster at lower dimensions even with collisions. As
you can see for realworld mimicking datasets like the one in VectorBenchmarks,
the "cache-friendliness" is not visible. At-least on Java6. I still remember 3
years ago when Jake and I spend almost 2 nights straight tuning Vectors and
bringing K-Means down from 30m to 1.5 min per iterations. There is a lot to be
done here and it has a scope to impact a lot of algorithms in Mahout.
On that note, I have seen a lot of attention recently on
JBLAS(http://mikiobraun.github.io/jblas/). For next-gen performance improvement
on Mahout it may be worthwhile for us to invest in that as opposed to tweaking
around in Bytecode expecting Java to do the right thing. I have recently
prototyped a Javascript vector code that does cosine distance faster than Java
using asm.js (http://kripken.github.io/mloc_emscripten_talk). Java is still
un-usable for large scale mathy stuff here at Google.
> SequentialAccessSparseVector function assignment is very slow
> -------------------------------------------------------------
>
> Key: MAHOUT-1190
> URL: https://issues.apache.org/jira/browse/MAHOUT-1190
> Project: Mahout
> Issue Type: Bug
> Reporter: Dan Filimon
>
> Currently when calling .assign() on a SASV with another vector and a custom
> function, it will iterate through it and assign every single entry while also
> referring it by index.
> This makes the process *hugely* expensive. (on a run of BallKMeans on the 20
> newsgroups data set, profiling reveals that 92% of the runtime was spent
> updating assigning the vectors).
> Here's a prototype patch:
> https://github.com/dfilimon/mahout/commit/63998d82bb750150a6ae09052dadf6c326c62d3d
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira