GitHub user gnsiva opened a pull request:

    https://github.com/apache/spark/pull/19716

    [SPARK-18755][WIP][ML] Random search implementation using 
RandomParamGridBuilder

    ## What changes were proposed in this pull request?
    
    Python `sklearn` has an implementation of random search to complement grid 
search. This usually allows for more efficient hyperparameter tuning.
    
    I have developed an alternative class to `ParamGridBuilder` called 
`RandomParamGridBuilder` which will facilitate random search in Spark. This 
works by creating the same data structure as the output of 
`ParamGridBuilder.build()` (and so is compatible with the existing 
`CrossValidator` class), through random sampling.
    
    The main methods by which the random sampling is used in the `sklearn` 
implementation are as follows:
    
    - Sampling through options e.g. `[0.01, 0.001, 0.0001]` 
      - This is handled using `addUniformChoice`
    - Sampling and integer/long/float/double between bounds 
      - `RandomParamGrid` has the `addUniformDistribution` method for this
    - Boolean sampling
      - `RandomParamGrid.addUniformDistribution` supports this as well
    - Or sampling over a more exotic distribution (e.g. beta or Cauchy)
      - Here the user can implement their own function which when called 
returns a value from the intended distribution, and add that using 
`addDistribution` thereby allowing full flexibility.
    
    ## How was this patch tested?
    
    Several unit tests have been created for the `RandomParamGridBuilder` in 
`RandomParamGridBuilderSuite`. 
    In `CrossValidatorSuite`, two tests were changed to run with param maps 
created by `RandomParamGridBuilder` as well as those from `ParamGridBuilder`. 
One additional test was added there as well (altered version of a 
`ParamGridBuilder` test). 

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/gnsiva/spark RandomParamGridBuilder

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19716.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19716
    
----
commit 12d194674218f5a6a076c2248faa81317e62ec82
Author: Ganesh N. Sivalingam <[email protected]>
Date:   2017-11-09T15:37:20Z

    Added RandomParamGridBuilder implementation

commit df47063052ae58dc4ca01398726ebbae7b727786
Author: Ganesh N. Sivalingam <[email protected]>
Date:   2017-11-09T15:37:50Z

    Added RandomParamGridBuilder unittests (which don't interact with CV)

commit ba58e83a0ad81ca1c98708e34daaae62b75dcc9f
Author: Ganesh N. Sivalingam <[email protected]>
Date:   2017-11-09T16:47:34Z

    Added 3 tests in the CrossValidatorSuite that use random search

commit f6fd40adcf4240c0d6675809ad4eddc72c23b85a
Author: Ganesh N. Sivalingam <[email protected]>
Date:   2017-11-09T16:55:39Z

    Simplified logistic regression test

commit cbbd0487525ddbbd7cdaaaa1e9e2fceae10f24a1
Author: Ganesh N. Sivalingam <[email protected]>
Date:   2017-11-09T17:02:29Z

    Style guide changes

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to