[ 
https://issues.apache.org/jira/browse/SYSTEMML-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niketan Pansare updated SYSTEMML-1962:
--------------------------------------
    Description: 
The end goal of this JIRA is to support model selection facility similar to 
[http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection].

Currently, we support model selection using MLPipeline's cross-validator. For 
example: please replace `from pyspark.ml.classification import 
LogisticRegression` with `from systemml.mllearn import LogisticRegression` in 
the example 
http://spark.apache.org/docs/2.1.1/ml-tuning.html#example-model-selection-via-cross-validation.
 

However, this invokes k-seperate and independent mlcontext calls. This PR 
proposes to add a new class `GridSearchCV`, `RandomizedSearchCV` and possibly 
bayesian optimization which like mllearn has methods `fit` and `predict`. These 
methods internally generate a script that wraps the external script with a 
`parfor` when the fit method is called. For example:

{code}
from sklearn import datasets
from systemml.mllearn import GridSearchCV, SVM
iris = datasets.load_iris()
parameters = {'C':[1, 10]}
svm = SVM()
clf = GridSearchClassifierCV(svm, parameters)
clf.fit(iris.data, iris.target)
{code}

would execute the script:
{code}
CVals = matrix("1; 10", rows=2, cols=1)
parfor(i in seq(1, nrow(CVals))) {
   C = CVals[i, 1]
    # SVM script
}
{code}

This will require:
1. Functionization of the script (for example: L2SVM)
{code}
svm = function(matrix[double] X, matrix[double] Y, double icpt, double tol, 
double reg, double maxiter) returns (matrix[double] w) {
   if(nrow(X) < 2)
        stop("Stopping due to invalid inputs: Not possible to learn a binary 
class classifier without at least 2 rows")
   check_min = min(Y)
   ....

   w = t(cbind(t(w), t(extra_model_params)))
}
{code}

 2. Adding two new java classes in the package `org.apache.sysml.api.ml` called 
`GridSearchClassifierCV` which extends `Estimator[GridSearchClassifierCVModel]` 
and `GridSearchClassifierCVModel` which `extends 
Model[GridSearchClassifierCVModel] with BaseSystemMLClassifierModel`. Then you 
will have to implement the abstract methods: fit and transform respectively.

3. Add a python class GridSearchClassifierCV that invokes the above java 
classes.

For more details on step 2 and step 3, please read the design documentation of 
mllearn API: 
https://github.com/apache/systemml/blob/master/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala#L42

[~dusenberrymw] may be, this can be part of 
https://issues.apache.org/jira/browse/SYSTEMML-1159

  was:
The end goal of this JIRA is to support model selection facility similar to 
[http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection].

Currently, we support model selection using MLPipeline's cross-validator. For 
example: please replace `from pyspark.ml.classification import 
LogisticRegression` with `from systemml.mllearn import LogisticRegression` in 
the example 
http://spark.apache.org/docs/2.1.1/ml-tuning.html#example-model-selection-via-cross-validation.
 

However, this invokes k-seperate and independent mlcontext calls. This PR 
proposes to add a new class `GridSearchCV`, `RandomizedSearchCV` and possibly 
bayesian optimization which like mllearn has methods `fit` and `predict`. These 
methods internally generate a script that wraps the external script with a 
`parfor` when the fit method is called. For example:

{code}
from sklearn import datasets
from systemml.mllearn import GridSearchCV, SVM
iris = datasets.load_iris()
parameters = {'C':[1, 10]}
svm = SVM()
clf = GridSearchClassifierCV(svm, parameters)
clf.fit(iris.data, iris.target)
{code}

would execute the script:
{code}
CVals = matrix("1; 10", rows=2, cols=1)
parfor(i in seq(1, nrow(CVals))) {
   C = CVals[i, 1]
    # SVM script
}
{code}

This will require:
1. Functionization of the script (for example: L2SVM)
{code}
svm = function(matrix[double] X, matrix[double] Y, double icpt, double tol, 
double reg, double maxiter) returns (matrix[double] w) {
   if(nrow(X) < 2)
        stop("Stopping due to invalid inputs: Not possible to learn a binary 
class classifier without at least 2 rows")
   check_min = min(Y)
   ....

   w = t(cbind(t(w), t(extra_model_params)))
}
{code}

 2. Adding two new java classes in the package `org.apache.sysml.api.ml` called 
`GridSearchClassifierCV` which extends `Estimator[GridSearchClassifierCVModel]` 
and `GridSearchClassifierCVModel` which `extends 
Model[GridSearchClassifierCVModel] with BaseSystemMLClassifierModel`. Then you 
will have to implement the abstract methods: fit and transform respectively.

3. Add a python class GridSearchClassifierCV that invokes the above java 
classes.

For more details on step 2 and step 3, please read the design documentation of 
mllearn API: 
https://github.com/apache/systemml/blob/master/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala#L42

[~dusenberrymw] 


> Support model-selection via mllearn APIs
> ----------------------------------------
>
>                 Key: SYSTEMML-1962
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1962
>             Project: SystemML
>          Issue Type: New Feature
>            Reporter: Niketan Pansare
>
> The end goal of this JIRA is to support model selection facility similar to 
> [http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection].
> Currently, we support model selection using MLPipeline's cross-validator. For 
> example: please replace `from pyspark.ml.classification import 
> LogisticRegression` with `from systemml.mllearn import LogisticRegression` in 
> the example 
> http://spark.apache.org/docs/2.1.1/ml-tuning.html#example-model-selection-via-cross-validation.
>  
> However, this invokes k-seperate and independent mlcontext calls. This PR 
> proposes to add a new class `GridSearchCV`, `RandomizedSearchCV` and possibly 
> bayesian optimization which like mllearn has methods `fit` and `predict`. 
> These methods internally generate a script that wraps the external script 
> with a `parfor` when the fit method is called. For example:
> {code}
> from sklearn import datasets
> from systemml.mllearn import GridSearchCV, SVM
> iris = datasets.load_iris()
> parameters = {'C':[1, 10]}
> svm = SVM()
> clf = GridSearchClassifierCV(svm, parameters)
> clf.fit(iris.data, iris.target)
> {code}
> would execute the script:
> {code}
> CVals = matrix("1; 10", rows=2, cols=1)
> parfor(i in seq(1, nrow(CVals))) {
>    C = CVals[i, 1]
>     # SVM script
> }
> {code}
> This will require:
> 1. Functionization of the script (for example: L2SVM)
> {code}
> svm = function(matrix[double] X, matrix[double] Y, double icpt, double tol, 
> double reg, double maxiter) returns (matrix[double] w) {
>    if(nrow(X) < 2)
>       stop("Stopping due to invalid inputs: Not possible to learn a binary 
> class classifier without at least 2 rows")
>    check_min = min(Y)
>    ....
>    w = t(cbind(t(w), t(extra_model_params)))
> }
> {code}
>  2. Adding two new java classes in the package `org.apache.sysml.api.ml` 
> called `GridSearchClassifierCV` which extends 
> `Estimator[GridSearchClassifierCVModel]` and `GridSearchClassifierCVModel` 
> which `extends Model[GridSearchClassifierCVModel] with 
> BaseSystemMLClassifierModel`. Then you will have to implement the abstract 
> methods: fit and transform respectively.
> 3. Add a python class GridSearchClassifierCV that invokes the above java 
> classes.
> For more details on step 2 and step 3, please read the design documentation 
> of mllearn API: 
> https://github.com/apache/systemml/blob/master/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala#L42
> [~dusenberrymw] may be, this can be part of 
> https://issues.apache.org/jira/browse/SYSTEMML-1159



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to