[ 
https://issues.apache.org/jira/browse/MAHOUT-1190?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13631787#comment-13631787
 ] 

Dan Filimon commented on MAHOUT-1190:
-------------------------------------

I'm working on a patch now. I'd also like to improve aggregate() and I'd love 
to see nearly all functions implemented in terms of aggregate() and assign().

I'm also failing some tests right now... but one thing I noticed which I was 
getting wrong is that you should only ever use the iterator when dealing with 
isSequentialAccess()==true vectors.
Otherwise, the order of the operations matters.
This doesn't seem like it should be a problem for clustering since from what 
I've seen it's only doing additions when adding a new point to a cluster.

Anyway, I'm also looking at the failing tests and since I'm changing more 
things, you can leave this to me.
I'll ping when I have a patch.

Thanks a lot!
                
> SequentialAccessSparseVector function assignment is very slow
> -------------------------------------------------------------
>
>                 Key: MAHOUT-1190
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1190
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Dan Filimon
>         Attachments: MAHOUT-1190-1.patch, MAHOUT-1190-iterator-fix.patch, 
> MAHOUT-1190-iterator-fix.patch, MAHOUT-1190-iterator-fix.patch, 
> MAHOUT-1190.patch, MAHOUT-1190-seq-dot-product.patch, 
> MAHOUT-1190-seq-dot-product.patch
>
>
> Currently when calling .assign() on a SASV with another vector and a custom 
> function, it will iterate through it and assign every single entry while also 
> referring it by index.
> This makes the process *hugely* expensive. (on a run of BallKMeans on the 20 
> newsgroups data set, profiling reveals that 92% of the runtime was spent 
> updating assigning the vectors).
> Here's a prototype patch:
> https://github.com/dfilimon/mahout/commit/63998d82bb750150a6ae09052dadf6c326c62d3d

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to