To my mind, the interesting size/shape is 1M-100M cardinality with 1K-10K
non-zeros.  The distribution of non-zeros should be sampled non-uniformly.

Can you point me at the benchmarks you are using?


On Sun, Apr 14, 2013 at 8:56 AM, Robin Anil (JIRA) <[email protected]> wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631336#comment-13631336]
>
> Robin Anil commented on MAHOUT-1190:
> ------------------------------------
>
> Here is a summary of the improvements across two sparsities (100:1, 10:1).
> At 10K cardinality DenseVectors are about 3x faster on CosineDistance than
> the sparse format.
>
> https://docs.google.com/spreadsheet/ccc?key=0AhewTD_ZgznddGFQbWJCQTZXSnFULUYzdURfWDRJQlE#gid=1
>
> > SequentialAccessSparseVector function assignment is very slow
> > -------------------------------------------------------------
> >
> >                 Key: MAHOUT-1190
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-1190
> >             Project: Mahout
> >          Issue Type: Bug
> >            Reporter: Dan Filimon
> >         Attachments: MAHOUT-1190-1.patch, MAHOUT-1190.patch
> >
> >
> > Currently when calling .assign() on a SASV with another vector and a
> custom function, it will iterate through it and assign every single entry
> while also referring it by index.
> > This makes the process *hugely* expensive. (on a run of BallKMeans on
> the 20 newsgroups data set, profiling reveals that 92% of the runtime was
> spent updating assigning the vectors).
> > Here's a prototype patch:
> >
> https://github.com/dfilimon/mahout/commit/63998d82bb750150a6ae09052dadf6c326c62d3d
>
> --
> This message is automatically generated by JIRA.
> If you think it was sent incorrectly, please contact your JIRA
> administrators
> For more information on JIRA, see: http://www.atlassian.com/software/jira
>

Reply via email to