GitHub user sethah opened a pull request:
https://github.com/apache/spark/pull/17094
[SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code
## What changes were proposed in this pull request?
JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)
This patch is a WIP.
The larger changes in this patch are:
* Adds a `DifferentiableLossAggregator` trait which is intended to be used
as a common parent trait to all Spark ML aggregator classes. It factors out the
common methods: `merge, gradient, loss, weight` from the aggregator subclasses.
* Adds a `RDDLossFunction` which is intended to be the only implementation
of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other
algorithms. It takes the aggregator type as a type parameter, and maps the
aggregator over an RDD. It additionally takes in a optional regularization loss
function for applying the differentiable part of regularization.
* Factors out the regularization from the data part of the cost function,
and treats regularization as a separate independent cost function which can be
evaluated and added to the data cost function.
* Changes `LinearRegression` to use this new hierarchy as a proof of
concept.
* Adds the following new namespaces `o.a.s.ml.optim.loss` and
`o.a.s.ml.optim.aggregator`
**NOTE: The large majority of the "lines added" and "lines deleted" are
simply code moving around or unit tests.**
BTW, I also converted LinearSVC to this framework as a way to prove that
this new hierarchy is flexible enough for the other algorithms, but I backed
those changes out because the PR is large enough as is.
## How was this patch tested?
Test suites are added for the new components, and some test suites are also
added to provide coverage where there wasn't any before.
* DifferentiablLossAggregatorSuite
* LeastSquaresAggregatorSuite
* RDDLossFunctionSuite
* DifferentiableRegularizationSuite
I would additionally like to run some performance/scale tests with linear
regression to ensure that there are no regressions. This patch is WIP until I
can complete the tests. Since the design will likely have some iteration, I'd
like to have it open for review before the scale tests are done.
## Follow ups
If this design is accepted, we will convert the other ML algorithms that
use this aggregator pattern to this new hierarchy in follow up PRs.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sethah/spark ml_aggregators
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17094.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17094
----
commit d6fae000d95284598e41d8bf95eb7067d8970e69
Author: sethah <[email protected]>
Date: 2017-02-27T19:03:03Z
consolidate ml aggregators
commit 86b56001a82f43fe1342bb1c26c6edcce6523865
Author: sethah <[email protected]>
Date: 2017-02-27T20:29:14Z
curried constructors
commit 06e547bdfb38d3b428a4a48c681aea989a11d625
Author: sethah <[email protected]>
Date: 2017-02-27T21:06:59Z
self types and docs
commit c930ced63b5c1faebe8063c1bf90a26cf9fae2be
Author: sethah <[email protected]>
Date: 2017-02-27T22:25:27Z
aggregator test suite
commit 6a596f23c855b2da0d9ba9133dee2f311dceb615
Author: sethah <[email protected]>
Date: 2017-02-27T23:03:16Z
loss function suite
commit 4b36119652173fff30c5869694015e1519753a05
Author: sethah <[email protected]>
Date: 2017-02-27T23:50:24Z
ls agg tests
commit ac55f06238cc9043ac2eaf282c3f8513a1a97076
Author: sethah <[email protected]>
Date: 2017-02-28T00:37:16Z
all tests passing, still need tests for regularization
commit ab5151ea41cde7d898bd65b998f674da3a5975ea
Author: sethah <[email protected]>
Date: 2017-02-28T01:07:59Z
regularization suite
commit 0366a8eefcef39c3251c9a7050944ada03bb4f47
Author: sethah <[email protected]>
Date: 2017-02-28T01:14:50Z
backing out svc changes
commit 28b88e48027959e0574c9d13236daff44fcdf650
Author: sethah <[email protected]>
Date: 2017-02-28T01:50:56Z
style cleanups and documentation
commit 9a04d0bc51bed29bca28a5e34ebc5b614b6560d2
Author: sethah <[email protected]>
Date: 2017-02-28T03:15:11Z
tolerances and imports
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]