GitHub user sethah opened a pull request:

    https://github.com/apache/spark/pull/17094

    [SPARK-19762][ML] Hierarchy for consolidating ML aggregator/loss code

    ## What changes were proposed in this pull request?
    
    JIRA: [SPARK-19762](https://issues.apache.org/jira/browse/SPARK-19762)
    
    This patch is a WIP. 
    
    The larger changes in this patch are:
    
    * Adds a `DifferentiableLossAggregator` trait which is intended to be used 
as a common parent trait to all Spark ML aggregator classes. It factors out the 
common methods: `merge, gradient, loss, weight` from the aggregator subclasses.
    * Adds a `RDDLossFunction` which is intended to be the only implementation 
of Breeze's `DiffFunction` necessary in Spark ML, and can be used by all other 
algorithms. It takes the aggregator type as a type parameter, and maps the 
aggregator over an RDD. It additionally takes in a optional regularization loss 
function for applying the differentiable part of regularization.
    * Factors out the regularization from the data part of the cost function, 
and treats regularization as a separate independent cost function which can be 
evaluated and added to the data cost function.
    * Changes `LinearRegression` to use this new hierarchy as a proof of 
concept.
    * Adds the following new namespaces `o.a.s.ml.optim.loss` and 
`o.a.s.ml.optim.aggregator`
    
    **NOTE: The large majority of the "lines added" and "lines deleted" are 
simply code moving around or unit tests.**
    
    BTW, I also converted LinearSVC to this framework as a way to prove that 
this new hierarchy is flexible enough for the other algorithms, but I backed 
those changes out because the PR is large enough as is. 
    
    ## How was this patch tested?
    Test suites are added for the new components, and some test suites are also 
added to provide coverage where there wasn't any before.
    
    * DifferentiablLossAggregatorSuite
    * LeastSquaresAggregatorSuite
    * RDDLossFunctionSuite
    * DifferentiableRegularizationSuite
    
    I would additionally like to run some performance/scale tests with linear 
regression to ensure that there are no regressions. This patch is WIP until I 
can complete the tests. Since the design will likely have some iteration, I'd 
like to have it open for review before the scale tests are done.
    
    ## Follow ups
    
    If this design is accepted, we will convert the other ML algorithms that 
use this aggregator pattern to this new hierarchy in follow up PRs. 


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/sethah/spark ml_aggregators

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17094.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17094
    
----
commit d6fae000d95284598e41d8bf95eb7067d8970e69
Author: sethah <[email protected]>
Date:   2017-02-27T19:03:03Z

    consolidate ml aggregators

commit 86b56001a82f43fe1342bb1c26c6edcce6523865
Author: sethah <[email protected]>
Date:   2017-02-27T20:29:14Z

    curried constructors

commit 06e547bdfb38d3b428a4a48c681aea989a11d625
Author: sethah <[email protected]>
Date:   2017-02-27T21:06:59Z

    self types and docs

commit c930ced63b5c1faebe8063c1bf90a26cf9fae2be
Author: sethah <[email protected]>
Date:   2017-02-27T22:25:27Z

    aggregator test suite

commit 6a596f23c855b2da0d9ba9133dee2f311dceb615
Author: sethah <[email protected]>
Date:   2017-02-27T23:03:16Z

    loss function suite

commit 4b36119652173fff30c5869694015e1519753a05
Author: sethah <[email protected]>
Date:   2017-02-27T23:50:24Z

    ls agg tests

commit ac55f06238cc9043ac2eaf282c3f8513a1a97076
Author: sethah <[email protected]>
Date:   2017-02-28T00:37:16Z

    all tests passing, still need tests for regularization

commit ab5151ea41cde7d898bd65b998f674da3a5975ea
Author: sethah <[email protected]>
Date:   2017-02-28T01:07:59Z

    regularization suite

commit 0366a8eefcef39c3251c9a7050944ada03bb4f47
Author: sethah <[email protected]>
Date:   2017-02-28T01:14:50Z

    backing out svc changes

commit 28b88e48027959e0574c9d13236daff44fcdf650
Author: sethah <[email protected]>
Date:   2017-02-28T01:50:56Z

    style cleanups and documentation

commit 9a04d0bc51bed29bca28a5e34ebc5b614b6560d2
Author: sethah <[email protected]>
Date:   2017-02-28T03:15:11Z

    tolerances and imports

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to