GitHub user dbtsai opened a pull request:

    https://github.com/apache/spark/pull/3288

    [SPARK-4431][MLlib] Implement efficient activeIterator for dense and sparse 
vector

    Previously, we were using Breeze's activeIterator to access the non-zero 
elements 
    in sparse vector, and explicitly skipping the zero in dense/sparse vector 
using 
    pattern matching. Due to the overhead, we switched back to native `while 
loop` 
    in #SPARK-4129.
    
    However, #SPARK-4129 requires de-reference the dv.values/sv.values in 
    each access to the value, and the zeros in dense vector and sparse vector 
    if exist are skipped in the add function call; the overall penalty will be 
    around 10% compared with de-reference once outside the while block, 
    and checking if zero before calling the add function. The code is branched 
out 
    for dense and sparse vector, and it's not easy to maintain in the long term.
    
    Not only this activeIterator implementation increases the performance, 
    but the abstraction of accessing the non-zero elements in different 
    vector type also helps the maintainability of codebase. In this PR, 
    only MultivariateOnlineSummarizer uses new API as example, 
    and others can be migrated to activeIterator later.
    
    Benchmarking with mnist8m dataset on single JVM 
    with first 200 samples loaded in memory, and repeating 5000 times.
    
    Before change: 
    Sparse Vector - 30.02
    Dense Vector - 38.27
    
    After this optimization:
    Sparse Vector - 27.54
    Dense Vector - 35.13


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/AlpineNow/spark activeIterator

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3288.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3288
    
----
commit 101c2eafb250b428f1b244e7f8057e63400f8f4e
Author: DB Tsai <[email protected]>
Date:   2014-11-13T07:08:13Z

    Finished SPARK-4431

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to