GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/17076
[SPARK-19745][ML] SVCAggregator captures coefficients in its closure
## What changes were proposed in this pull request?
JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745)
Reorganize SVCAggregator to avoid serializing coefficients. This patch also
makes the gradient array a `lazy val` which will avoid materializing a large
array on the driver before shipping the class to the executors. This
improvement stems from https://github.com/apache/spark/pull/16037. Actually,
probably all ML aggregators can benefit from this.
We can either: a.) separate the gradient improvement into another patch b.)
keep what's here _plus_ add the lazy evaluation to all other aggregators in
this patch or c.) keep it as is.
## How was this patch tested?
This is an interesting question! I don't know of a reasonable way to test
this right now. Ideally, we could perform an optimization and look at the
shuffle write data for each task, and we could compare the size to what it we
know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way
to do that right now? We could discuss this here or in another JIRA, but I
suspect it would be a significant undertaking. For now, I verified through the
web ui:
**Before**

**After**

You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark svc_agg
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17076.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17076
----
commit e87e3347042dbe1a6beb2fed2213da7e10a8abd9
Author: sethah <[email protected]>
Date: 2017-02-27T04:25:14Z
performance cleanup in svc agg
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]