GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/17076

    [SPARK-19745][ML] SVCAggregator captures coefficients in its closure

    ## What changes were proposed in this pull request?
    
    JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745)
    
    Reorganize SVCAggregator to avoid serializing coefficients. This patch also 
makes the gradient array a `lazy val` which will avoid materializing a large 
array on the driver before shipping the class to the executors. This 
improvement stems from https://github.com/apache/spark/pull/16037. Actually, 
probably all ML aggregators can benefit from this. 
    
    We can either: a.) separate the gradient improvement into another patch b.) 
keep what's here _plus_ add the lazy evaluation to all other aggregators in 
this patch or c.) keep it as is.
    
    ## How was this patch tested?
    
    This is an interesting question! I don't know of a reasonable way to test 
this right now. Ideally, we could perform an optimization and look at the 
shuffle write data for each task, and we could compare the size to what it we 
know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way 
to do that right now? We could discuss this here or in another JIRA, but I 
suspect it would be a significant undertaking. For now, I verified through the 
web ui:
    
    **Before**
    
    
![image](https://cloud.githubusercontent.com/assets/7275795/23348865/eeb48382-fc62-11e6-8a97-48e262ee02b8.png)
    
    
    **After**
    
    
![image](https://cloud.githubusercontent.com/assets/7275795/23348872/f933fe14-fc62-11e6-8a1d-b5145775457e.png)
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark svc_agg

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17076.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17076
    
----
commit e87e3347042dbe1a6beb2fed2213da7e10a8abd9
Author: sethah <[email protected]>
Date:   2017-02-27T04:25:14Z

    performance cleanup in svc agg

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to