GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/3099
[WIP][SPARK-3530][MLLIB] pipeline and parameters ### This is a WIP, so please do not spend time on individual lines. This PR adds package "org.apache.spark.ml" with pipeline and parameters, as discussed on the JIRA. This is a joint work of @jkbradley @etrain @shivaram and many others who helped on the design, also with help from @marmbrus and @liancheng on the Spark SQL side. The design doc can be found at: https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing Again, though it compiles, this is a WIP. So please do not spend time on individual lines. The new set of APIs, inspired by the MLI project from AMPLab and scikit-learn, takes leverage on Spark SQL's schema support and execution plan optimization. It introduces the following components that help build a practical pipeline: 1. Transformer, which transforms a dataset into another 2. Estimator, which fits models to data, where models are transformers 3. Evaluator, which evaluates model output and returns a scalar metric 4. Pipeline, a simple pipeline that consists of transformers and estimators Parameters could be supplied at fit/transform or embedded with parameters. 1. Param: a strong-typed parameter key with self-contained doc 2. ParamMap: a param -> value map 3. Params: trait for components with parameters 4. OwnParamMap: trait for components with embedded parameter map For any component that implements `Params`, user can easily check the doc by calling `explainParams`: ~~~ > val lr = new LogisticRegression > lr.explainParams maxIter: max number of iterations (default: 100) regParam: regularization constant (default: 0.1) labelCol: label column name (default: label) featuresCol: features column name (default: features) ~~~ or user can check individual param: ~~~ > lr.maxIter maxIter: max number of iterations (default: 100) ~~~ **Please start with the example code in `LogisticRegressionSuite.scala`, where I put three examples:** 1. run a simple logistic regression job ~~~ val lr = new LogisticRegression lr.set(lr.maxIter, 10) .set(lr.regParam, 1.0) val model = lr.fit(dataset) model.transform(dataset, model.threshold -> 0.8) // overwrite threshold .select('label, 'score, 'prediction).collect() .foreach(println) ~~~ 2. run logistic regression with cross-validation and grid search using areaUnderROC as the metric ~~~ val lr = new LogisticRegression val cv = new CrossValidator val eval = new BinaryClassificationEvaluator val lrParamMaps = new ParamGridBuilder() .addMulti(lr.regParam, Array(0.1, 100.0)) .addMulti(lr.maxIter, Array(0, 5)) .build() cv.set(cv.estimator, lr.asInstanceOf[Estimator[_]]) .set(cv.estimatorParamMaps, lrParamMaps) .set(cv.evaluator, eval) .set(cv.numFolds, 3) val bestModel = cv.fit(dataset) ~~~ 3. run a pipeline consists of a standard scaler and a logistic regression component ~~~ val scaler = new StandardScaler scaler .set(scaler.inputCol, "features") .set(scaler.outputCol, "scaledFeatures") val lr = new LogisticRegression lr.set(lr.featuresCol, "scaledFeatures") val pipeline = new Pipeline val model = pipeline.fit(dataset, pipeline.stages -> Array(scaler, lr)) val predictions = model.transform(dataset) .select('label, 'score, 'prediction) .collect() .foreach(println) ~~~ **What are missing now and will be added soon:** 1. Runtime check of schemas. So before we touch the data, we will go through the schema and make sure column names and types match the input parameters. 2. Java examples. 3. Serialization. 4. Store training parameters in trained models. **What I'm not very confident with and definitely need feedback:** The usage of parameters is a little messy. The interface I do want to keep is the one that supports multi-model training: ~~~ def fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Array[Model] ~~~ which leaves space for the algorithm to optimize the training phrase. This is different from scikit-learn's design, where `fit` returns itself. But for large-scale datasets, we do want to do training in parallel and returns multiple models. Now there are two places you can specify parameters: 1. using the embedded paramMap ~~~ lr.set(lr.maxIter, 50) ~~~ 2. at fit/transform time: ~~~ lr.fit(dataset, lr.maxIter -> 50) ~~~ The latter overwrites the former. I can change the former to builder pattern ~~~ lr.setMaxIter(50) ~~~ which may look better. But there may be better solutions. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mengxr/spark SPARK-3530 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/3099.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3099 ---- commit 376db0add8ec05252a1e8ffc2f622f3dfab1b0cd Author: Xiangrui Meng <m...@databricks.com> Date: 2014-11-05T01:11:25Z pipeline and parameters ---- --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org