[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

mengxr Tue, 04 Nov 2014 17:52:03 -0800

GitHub user mengxr opened a pull request:

    https://github.com/apache/spark/pull/3099


    [WIP][SPARK-3530][MLLIB] pipeline and parameters

    ### This is a WIP, so please do not spend time on individual lines.
    
    This PR adds package "org.apache.spark.ml" with pipeline and parameters, as 
discussed on the JIRA. This is a joint work of @jkbradley @etrain @shivaram and 
many others who helped on the design, also with help from  @marmbrus and 
@liancheng on the Spark SQL side. The design doc can be found at:
    
    
https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o/edit?usp=sharing
    
    Again, though it compiles, this is a WIP. So please do not spend time on 
individual lines.
    
    The new set of APIs, inspired by the MLI project from AMPLab and 
scikit-learn, takes leverage on Spark SQL's schema support and execution plan 
optimization. It introduces the following components that help build a 
practical pipeline:
    
    1. Transformer, which transforms a dataset into another
    2. Estimator, which fits models to data, where models are transformers
    3. Evaluator, which evaluates model output and returns a scalar metric
    4. Pipeline, a simple pipeline that consists of transformers and estimators
    
    Parameters could be supplied at fit/transform or embedded with parameters.
    
    1. Param: a strong-typed parameter key with self-contained doc
    2. ParamMap: a param -> value map
    3. Params: trait for components with parameters
    4. OwnParamMap: trait for components with embedded parameter map
    
    For any component that implements `Params`, user can easily check the doc 
by calling `explainParams`:
    
    ~~~
    > val lr = new LogisticRegression
    > lr.explainParams
    maxIter: max number of iterations (default: 100)
    regParam: regularization constant (default: 0.1)
    labelCol: label column name (default: label)
    featuresCol: features column name (default: features)
    ~~~
    
    or user can check individual param:
    
    ~~~
    > lr.maxIter
    maxIter: max number of iterations (default: 100)
    ~~~
    
    **Please start with the example code in `LogisticRegressionSuite.scala`, 
where I put three examples:**
    
    1. run a simple logistic regression job
    
    ~~~
        val lr = new LogisticRegression
        lr.set(lr.maxIter, 10)
          .set(lr.regParam, 1.0)
        val model = lr.fit(dataset)
        model.transform(dataset, model.threshold -> 0.8) // overwrite threshold
          .select('label, 'score, 'prediction).collect()
          .foreach(println)
    ~~~
    
    2. run logistic regression with cross-validation and grid search using 
areaUnderROC as the metric
    
    ~~~
        val lr = new LogisticRegression
        val cv = new CrossValidator
        val eval = new BinaryClassificationEvaluator
        val lrParamMaps = new ParamGridBuilder()
          .addMulti(lr.regParam, Array(0.1, 100.0))
          .addMulti(lr.maxIter, Array(0, 5))
          .build()
        cv.set(cv.estimator, lr.asInstanceOf[Estimator[_]])
          .set(cv.estimatorParamMaps, lrParamMaps)
          .set(cv.evaluator, eval)
          .set(cv.numFolds, 3)
        val bestModel = cv.fit(dataset)
    ~~~
    
    3. run a pipeline consists of a standard scaler and a logistic regression 
component
    
    ~~~
        val scaler = new StandardScaler
        scaler
          .set(scaler.inputCol, "features")
          .set(scaler.outputCol, "scaledFeatures")
        val lr = new LogisticRegression
        lr.set(lr.featuresCol, "scaledFeatures")
        val pipeline = new Pipeline
        val model = pipeline.fit(dataset, pipeline.stages -> Array(scaler, lr))
        val predictions = model.transform(dataset)
          .select('label, 'score, 'prediction)
          .collect()
          .foreach(println)
    ~~~
    
    **What are missing now and will be added soon:**
    
    1. Runtime check of schemas. So before we touch the data, we will go 
through the schema and make sure column names and types match the input 
parameters.
    2. Java examples.
    3. Serialization.
    4. Store training parameters in trained models.
    
    **What I'm not very confident with and definitely need feedback:**
    
    The usage of parameters is a little messy. The interface I do want to keep 
is the one that supports multi-model training:
    
    ~~~
    def fit(dataset: SchemaRDD, paramMaps: Array[ParamMap]): Array[Model]
    ~~~
    
    which leaves space for the algorithm to optimize the training phrase. This 
is different from scikit-learn's design, where `fit` returns itself. But for 
large-scale datasets, we do want to do training in parallel and returns 
multiple models. Now there are two places you can specify parameters:
    
    1. using the embedded paramMap
    
    ~~~
    lr.set(lr.maxIter, 50)
    ~~~
    
    2. at fit/transform time:
    
    ~~~
    lr.fit(dataset, lr.maxIter -> 50)
    ~~~
    
    The latter overwrites the former. I can change the former to builder pattern
    
    ~~~
    lr.setMaxIter(50)
    ~~~
    
    which may look better. But there may be better solutions.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mengxr/spark SPARK-3530

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3099.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3099
    
----
commit 376db0add8ec05252a1e8ffc2f622f3dfab1b0cd
Author: Xiangrui Meng <m...@databricks.com>
Date:   2014-11-05T01:11:25Z

    pipeline and parameters

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to