[ 
https://issues.apache.org/jira/browse/SPARK-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213826#comment-14213826
 ] 

Apache Spark commented on SPARK-4431:
-------------------------------------

User 'dbtsai' has created a pull request for this issue:
https://github.com/apache/spark/pull/3288

> Implement efficient activeIterator for dense and sparse vector
> --------------------------------------------------------------
>
>                 Key: SPARK-4431
>                 URL: https://issues.apache.org/jira/browse/SPARK-4431
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>
> Previously, we were using Breeze's activeIterator to access the non-zero 
> elements in sparse vector, and explicitly skipping the zero in dense/sparse 
> vector using pattern matching. Due to the overhead, we switched back to 
> native `while loop` in #SPARK-4129.
> However, #SPARK-4129 requires de-reference the dv.values/sv.values in each 
> access to the value, and the zeros in dense vector and sparse vector if exist 
> are skipped in the add function call; the overall penalty will be around 10% 
> compared with de-reference once outside the while block, and checking if zero 
> before calling the add function. The code is branched out for dense and 
> sparse vector, and it's not easy to maintain in the long term.
> Not only this activeIterator implementation increases the performance, but 
> the abstraction of accessing the non-zero elements in different vector type 
> also helps the maintainability of codebase. In this PR, only 
> MultivariateOnlineSummarizer uses new API as example, and others can be 
> migrated to activeIterator later. 
> Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded 
> in memory, and repeating 5000 times. 
> Before change: 
> Sparse Vector - 30.02
> Dense Vector - 38.27
> After this optimization:
> Sparse Vector - 27.54
> Dense Vector - 35.13



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to