GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/15593
[SPARK-18060][ML] Avoid unnecessary computation for MLOR
## What changes were proposed in this pull request?
Before this patch, the gradient updates for multinomial logistic regression
were computed by an outer loop over the number of classes and an inner loop
over the number of features. Inside the inner loop, we standardized the feature
value (`value / featuresStd(index)`), which means we performed the computation
`numFeatures * numClasses` times. We only need to perform that computation
`numFeatures` times, however. If we re-order the inner and outer loop, we can
avoid this, but then we lose sequential memory access. In this patch, we
instead lay out the coefficients in column major order while we train, so that
we can avoid the extra computation and retain sequential memory access.
We convert back to row-major order when we create the model, since the
vector matrix multiply required by predict will access the coefficients in
row-major order.
## How was this patch tested?
This is an implementation detail only, so the original behavior should be
maintained. All tests pass. I ran some performance tests to verify speedups.
The results are below, and show significant speedups.
## Performance Tests
**Setup**
3 node bare-metal cluster
120 cores total
384 gb RAM total
**Results**
| | numPoints | numFeatures | numClasses | regParam |
elasticNetParam | currentMasterTime (sec) | thisPatchTime (sec) |
pctSpeedup |
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
| 0 | 1e+07 | 100 | 500 | 0.5 |
0 | 90 | 18 | 80 |
| 1 | 1e+08 | 100 | 50 | 0.5 |
0 | 90 | 19 | 78 |
| 2 | 1e+08 | 100 | 50 | 0.05 |
1 | 72 | 19 | 73 |
| 3 | 1e+06 | 100 | 5000 | 0.5 |
0 | 93 | 53 | 43 |
| 4 | 1e+07 | 100 | 5000 | 0.5 |
0 | 900 | 390 | 56 |
| 5 | 1e+08 | 100 | 500 | 0.5 |
0 | 840 | 174 | 79 |
| 6 | 1e+08 | 100 | 200 | 0.5 |
0 | 360 | 72 | 80 |
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark MLOR_PERF_COL_MAJOR_COEF
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15593.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15593
----
commit 4c19abebe0b78bcd26fc142ef6787517e1e4482d
Author: sethah <[email protected]>
Date: 2016-10-21T17:19:50Z
tests pass except initial model
commit fcab96a3d608ca49d8a8963f79a277163d87ddce
Author: sethah <[email protected]>
Date: 2016-10-21T19:49:39Z
initialModel passes
commit 07fd1504136ad7b1ce37f443e26f407b07345991
Author: sethah <[email protected]>
Date: 2016-10-21T23:01:27Z
clean up and refactoring exp op in log agg
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]