GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/5820

    [SPARK-5956][MLLIB] Pipeline components should be copyable.

    This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a 
copy of the current instance with a randomly generated uid and some extra param 
values. With this change, we can only implement `fit` and `transform` without 
extra param values given the default implementation of `fit(dataset, extra)`:
    
    ~~~scala
    def fit(dataset: DataFrame, extra: ParamMap): Model = {
      copy(extra).fit(dataset)
    }
    ~~~
    
    Inside `fit` and `transform`, since only the embedded values are used, I 
added `$` as an alias for `getOrDefault` to make the code easier to read. For 
example, in `LinearRegression.fit` we have:
    
    ~~~scala
    val effectiveRegParam = $(regParam) / yStd
    val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
    val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
    ~~~
    
    Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the 
fitted pipeline model stored all copied stages (no matter whether it is a 
transformer or a model).
    
    Other changes:
    * `Params$.inheritValues` is moved to `Params!.copyValues` and returns the 
target instance.
    * `fittingParamMap` was removed because the `parent` carries this 
information.
    * `validate` was renamed to `validateParams` to be more precise.
    
    TODOs:
    * [ ] add tests for newly added methods
    * [ ] update documentation
    
    @jkbradley

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-5956

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/5820.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5820
    
----
commit f082a310794dcbecf17ec3d44ab770beed84f956
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T08:30:57Z

    make Params copyable and simply handling of extra params in all spark.ml 
components

commit d882afc0956c59da78b5e0172f86e9c1f5432028
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T08:48:04Z

    test compile

commit 9ee004e7c534db7a6ab0434eec97feca86c7daad
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T16:52:58Z

    merge copy and copyWith; rename validate to validateParams

commit 53e097373e9c9707bb7486adc3c20fee384356c1
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T17:34:50Z

    move inheritValues to Params and rename it to copyValues

commit 9286a228c8e141dbcd66eaf21e9cfb29bc1d344a
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T17:40:02Z

    copyValues to trained models

commit 0f4fd64740763792087405f8921d4cd2bf00217a
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T18:07:38Z

    fix some tests

commit c76b4d120da94350efcb4638c8a24e4fbeabb177
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T20:02:24Z

    fix all unit tests

commit 5a67779b20aeb0db7c824a96f5b1cc2e7718f7c5
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T20:16:32Z

    examples compile

commit b642872be33a250ef1cefbbb51c3bd8a3d37b7ba
Author: Xiangrui Meng <[email protected]>
Date:   2015-04-30T20:35:31Z

    example code runs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to