[jira] [Updated] (SPARK-4431) Implement efficient activeIterator for dense and sparse vector

DB Tsai (JIRA) Fri, 21 Nov 2014 15:15:06 -0800

     [ 
https://issues.apache.org/jira/browse/SPARK-4431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


DB Tsai updated SPARK-4431:
---------------------------
    Description: 
Previously, we were using Breeze's activeIterator to access the non-zero 
elements 
in dense/sparse vector. Due to the overhead, we switched back to native while 
loop 
in #SPARK-4129.

However, #SPARK-4129 requires de-reference the dv.values/sv.values in 
each access to the value, which is very expensive. Also, in 
MultivariateOnlineSummarizer,
we're using Breeze's dense vector to store the partial stats, and this is very 
expensive compared
with using primitive scala array.

In this PR, efficient foreachActive is implemented to unify the code path for 
dense and sparse
vector operation which makes codebase easier to maintain. Breeze dense vector 
is replaced 
by primitive array to reduce the overhead further.

Benchmarking with mnist8m dataset on single JVM 
with first 200 samples loaded in memory, and repeating 5000 times.

Before change: 
Sparse Vector - 30.02
Dense Vector - 38.27

With this PR:
Sparse Vector - 6.29
Dense Vector - 11.72

  was:
Previously, we were using Breeze's activeIterator to access the non-zero 
elements in sparse vector, and explicitly skipping the zero in dense/sparse 
vector using pattern matching. Due to the overhead, we switched back to native 
`while loop` in #SPARK-4129.

However, #SPARK-4129 requires de-reference the dv.values/sv.values in each 
access to the value, and the zeros in dense vector and sparse vector if exist 
are skipped in the add function call; the overall penalty will be around 10% 
compared with de-reference once outside the while block, and checking if zero 
before calling the add function. The code is branched out for dense and sparse 
vector, and it's not easy to maintain in the long term.

Not only this activeIterator implementation increases the performance, but the 
abstraction of accessing the non-zero elements in different vector type also 
helps the maintainability of codebase. In this PR, only 
MultivariateOnlineSummarizer uses new API as example, and others can be 
migrated to activeIterator later. 

Benchmarking with mnist8m dataset on single JVM with first 200 samples loaded 
in memory, and repeating 5000 times. 

Before change: 
Sparse Vector - 30.02
Dense Vector - 38.27

After this optimization:
Sparse Vector - 27.54
Dense Vector - 35.13


> Implement efficient activeIterator for dense and sparse vector
> --------------------------------------------------------------
>
>                 Key: SPARK-4431
>                 URL: https://issues.apache.org/jira/browse/SPARK-4431
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>            Reporter: DB Tsai
>            Assignee: DB Tsai
>
> Previously, we were using Breeze's activeIterator to access the non-zero 
> elements 
> in dense/sparse vector. Due to the overhead, we switched back to native while 
> loop 
> in #SPARK-4129.
> However, #SPARK-4129 requires de-reference the dv.values/sv.values in 
> each access to the value, which is very expensive. Also, in 
> MultivariateOnlineSummarizer,
> we're using Breeze's dense vector to store the partial stats, and this is 
> very expensive compared
> with using primitive scala array.
> In this PR, efficient foreachActive is implemented to unify the code path for 
> dense and sparse
> vector operation which makes codebase easier to maintain. Breeze dense vector 
> is replaced 
> by primitive array to reduce the overhead further.
> Benchmarking with mnist8m dataset on single JVM 
> with first 200 samples loaded in memory, and repeating 5000 times.
> Before change: 
> Sparse Vector - 30.02
> Dense Vector - 38.27
> With this PR:
> Sparse Vector - 6.29
> Dense Vector - 11.72



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-4431) Implement efficient activeIterator for dense and sparse vector

Reply via email to