[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

sethah Fri, 21 Oct 2016 16:32:57 -0700

GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/15593


    [SPARK-18060][ML] Avoid unnecessary computation for MLOR

    ## What changes were proposed in this pull request?
    
    Before this patch, the gradient updates for multinomial logistic regression 
were computed by an outer loop over the number of classes and an inner loop 
over the number of features. Inside the inner loop, we standardized the feature 
value (`value / featuresStd(index)`), which means we performed the computation 
`numFeatures * numClasses` times. We only need to perform that computation 
`numFeatures` times, however. If we re-order the inner and outer loop, we can 
avoid this, but then we lose sequential memory access. In this patch, we 
instead lay out the coefficients in column major order while we train, so that 
we can avoid the extra computation and retain sequential memory access.
    
    We convert back to row-major order when we create the model, since the 
vector matrix multiply required by predict will access the coefficients in 
row-major order.
    
    ## How was this patch tested?
    
    This is an implementation detail only, so the original behavior should be 
maintained. All tests pass. I ran some performance tests to verify speedups. 
The results are below, and show significant speedups.
    
    ## Performance Tests
    
    **Setup**
    
    3 node bare-metal cluster
    120 cores total
    384 gb RAM total
    
    
    **Results**
    
    |    |   numPoints |   numFeatures |   numClasses |   regParam |   
elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   
pctSpeedup |
    
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
    |  0 |       1e+07 |           100 |          500 |       0.5  |            
     0 |                        90 |                    18 |           80 |
    |  1 |       1e+08 |           100 |           50 |       0.5  |            
     0 |                        90 |                    19 |           78 |
    |  2 |       1e+08 |           100 |           50 |       0.05 |            
     1 |                        72 |                    19 |           73 |
    |  3 |       1e+06 |           100 |         5000 |       0.5  |            
     0 |                        93 |                    53 |           43 |
    |  4 |       1e+07 |           100 |         5000 |       0.5  |            
     0 |                       900 |                   390 |           56 |
    |  5 |       1e+08 |           100 |          500 |       0.5  |            
     0 |                       840 |                   174 |           79 |
    |  6 |       1e+08 |           100 |          200 |       0.5  |            
     0 |                       360 |                    72 |           80 |

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark MLOR_PERF_COL_MAJOR_COEF

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15593.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15593
    
----
commit 4c19abebe0b78bcd26fc142ef6787517e1e4482d
Author: sethah <[email protected]>
Date:   2016-10-21T17:19:50Z

    tests pass except initial model

commit fcab96a3d608ca49d8a8963f79a277163d87ddce
Author: sethah <[email protected]>
Date:   2016-10-21T19:49:39Z

    initialModel passes

commit 07fd1504136ad7b1ce37f443e26f407b07345991
Author: sethah <[email protected]>
Date:   2016-10-21T23:01:27Z

    clean up and refactoring exp op in log agg

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15593: [SPARK-18060][ML] Avoid unnecessary computation f...

Reply via email to