GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/13729

    [SPARK-16008] Remove unnecessary serialization in logistic regression

    ## What changes were proposed in this pull request?
    `LogisticAggregator` stores references to two arrays of dimension 
`numFeatures` which are serialized before the combine op, unnecessarily. This 
results in the shuffle write being ~3x (for multiclass logistic regression, 
this number will go up) larger than it should be (in MLlib, for instance, it is 
3x smaller).
    
    This patch modifies `LogisticAggregator.add` to accept the two arrays as 
method parameters which avoids the serialization. 
    
    ## How was this patch tested?
    
    I tested this locally and verified the serialization reduction. 
    
    
![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png)
    
    Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB 
RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark lr_improvement

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/13729.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #13729
    
----
commit 2a1e632e747620f6fa0f2fcabce9431c6061e742
Author: sethah <[email protected]>
Date:   2016-06-16T23:24:52Z

    remove unnecessary serialization

commit 86505b75cd083c82a3cdefe1221eb6ed4e9750bb
Author: sethah <[email protected]>
Date:   2016-06-17T01:28:10Z

    comments

commit ef8fdea808052846055979c642b5f47255ee9e3d
Author: sethah <[email protected]>
Date:   2016-06-17T03:45:34Z

    dimension corrections

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to