[ https://issues.apache.org/jira/browse/SYSTEMML-1962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Niketan Pansare updated SYSTEMML-1962: -------------------------------------- Description: The end goal of this JIRA is to support model selection facility similar to [http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection]. Currently, we support model selection using MLPipeline's cross-validator. For example: please replace `from pyspark.ml.classification import LogisticRegression` with `from systemml.mllearn import LogisticRegression` in the example http://spark.apache.org/docs/2.1.1/ml-tuning.html#example-model-selection-via-cross-validation. However, this invokes k-seperate and independent mlcontext calls. This PR proposes to add a new class `GridSearchCV`, `RandomizedSearchCV` and possibly bayesian optimization which like mllearn has methods `fit` and `predict`. These methods internally generate a script that wraps the external script with a `parfor` when the fit method is called. For example: {code} from sklearn import datasets from systemml.mllearn import GridSearchCV, SVM iris = datasets.load_iris() parameters = {'C':[1, 10]} svm = SVM() clf = GridSearchClassifierCV(svm, parameters) clf.fit(iris.data, iris.target) {code} would execute the script: {code} CVals = matrix("1; 10", rows=2, cols=1) parfor(i in seq(1, nrow(CVals))) { C = CVals[i, 1] # SVM script } {code} This will require: 1. Functionization of the script (for example: L2SVM) {code} svm = function(matrix[double] X, matrix[double] Y, double icpt, double tol, double reg, double maxiter) returns (matrix[double] w) { if(nrow(X) < 2) stop("Stopping due to invalid inputs: Not possible to learn a binary class classifier without at least 2 rows") check_min = min(Y) .... w = t(cbind(t(w), t(extra_model_params))) } {code} 2. Adding two new java classes in the package `org.apache.sysml.api.ml` called `GridSearchClassifierCV` which extends `Estimator[GridSearchClassifierCVModel]` and `GridSearchClassifierCVModel` which `extends Model[GridSearchClassifierCVModel] with BaseSystemMLClassifierModel`. Then you will have to implement the abstract methods: fit and transform respectively. 3. Add a python class GridSearchClassifierCV that invokes the above java classes. For more details on step 2 and step 3, please read the design documentation of mllearn API: https://github.com/apache/systemml/blob/master/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala#L42 [~dusenberrymw] may be, this can be part of https://issues.apache.org/jira/browse/SYSTEMML-1159 was: The end goal of this JIRA is to support model selection facility similar to [http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection]. Currently, we support model selection using MLPipeline's cross-validator. For example: please replace `from pyspark.ml.classification import LogisticRegression` with `from systemml.mllearn import LogisticRegression` in the example http://spark.apache.org/docs/2.1.1/ml-tuning.html#example-model-selection-via-cross-validation. However, this invokes k-seperate and independent mlcontext calls. This PR proposes to add a new class `GridSearchCV`, `RandomizedSearchCV` and possibly bayesian optimization which like mllearn has methods `fit` and `predict`. These methods internally generate a script that wraps the external script with a `parfor` when the fit method is called. For example: {code} from sklearn import datasets from systemml.mllearn import GridSearchCV, SVM iris = datasets.load_iris() parameters = {'C':[1, 10]} svm = SVM() clf = GridSearchClassifierCV(svm, parameters) clf.fit(iris.data, iris.target) {code} would execute the script: {code} CVals = matrix("1; 10", rows=2, cols=1) parfor(i in seq(1, nrow(CVals))) { C = CVals[i, 1] # SVM script } {code} This will require: 1. Functionization of the script (for example: L2SVM) {code} svm = function(matrix[double] X, matrix[double] Y, double icpt, double tol, double reg, double maxiter) returns (matrix[double] w) { if(nrow(X) < 2) stop("Stopping due to invalid inputs: Not possible to learn a binary class classifier without at least 2 rows") check_min = min(Y) .... w = t(cbind(t(w), t(extra_model_params))) } {code} 2. Adding two new java classes in the package `org.apache.sysml.api.ml` called `GridSearchClassifierCV` which extends `Estimator[GridSearchClassifierCVModel]` and `GridSearchClassifierCVModel` which `extends Model[GridSearchClassifierCVModel] with BaseSystemMLClassifierModel`. Then you will have to implement the abstract methods: fit and transform respectively. 3. Add a python class GridSearchClassifierCV that invokes the above java classes. For more details on step 2 and step 3, please read the design documentation of mllearn API: https://github.com/apache/systemml/blob/master/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala#L42 [~dusenberrymw] > Support model-selection via mllearn APIs > ---------------------------------------- > > Key: SYSTEMML-1962 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1962 > Project: SystemML > Issue Type: New Feature > Reporter: Niketan Pansare > > The end goal of this JIRA is to support model selection facility similar to > [http://scikit-learn.org/stable/modules/classes.html#module-sklearn.model_selection]. > Currently, we support model selection using MLPipeline's cross-validator. For > example: please replace `from pyspark.ml.classification import > LogisticRegression` with `from systemml.mllearn import LogisticRegression` in > the example > http://spark.apache.org/docs/2.1.1/ml-tuning.html#example-model-selection-via-cross-validation. > > However, this invokes k-seperate and independent mlcontext calls. This PR > proposes to add a new class `GridSearchCV`, `RandomizedSearchCV` and possibly > bayesian optimization which like mllearn has methods `fit` and `predict`. > These methods internally generate a script that wraps the external script > with a `parfor` when the fit method is called. For example: > {code} > from sklearn import datasets > from systemml.mllearn import GridSearchCV, SVM > iris = datasets.load_iris() > parameters = {'C':[1, 10]} > svm = SVM() > clf = GridSearchClassifierCV(svm, parameters) > clf.fit(iris.data, iris.target) > {code} > would execute the script: > {code} > CVals = matrix("1; 10", rows=2, cols=1) > parfor(i in seq(1, nrow(CVals))) { > C = CVals[i, 1] > # SVM script > } > {code} > This will require: > 1. Functionization of the script (for example: L2SVM) > {code} > svm = function(matrix[double] X, matrix[double] Y, double icpt, double tol, > double reg, double maxiter) returns (matrix[double] w) { > if(nrow(X) < 2) > stop("Stopping due to invalid inputs: Not possible to learn a binary > class classifier without at least 2 rows") > check_min = min(Y) > .... > w = t(cbind(t(w), t(extra_model_params))) > } > {code} > 2. Adding two new java classes in the package `org.apache.sysml.api.ml` > called `GridSearchClassifierCV` which extends > `Estimator[GridSearchClassifierCVModel]` and `GridSearchClassifierCVModel` > which `extends Model[GridSearchClassifierCVModel] with > BaseSystemMLClassifierModel`. Then you will have to implement the abstract > methods: fit and transform respectively. > 3. Add a python class GridSearchClassifierCV that invokes the above java > classes. > For more details on step 2 and step 3, please read the design documentation > of mllearn API: > https://github.com/apache/systemml/blob/master/src/main/scala/org/apache/sysml/api/ml/BaseSystemMLClassifier.scala#L42 > [~dusenberrymw] may be, this can be part of > https://issues.apache.org/jira/browse/SYSTEMML-1159 -- This message was sent by Atlassian JIRA (v6.4.14#64029)