[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

mengxr Fri, 07 Nov 2014 11:14:35 -0800

Github user mengxr commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-62196531
  
    > @mengxr @jkbradley I tried to port one of our simple image processing 
pipelines to the new interface today and the code for this is at 
https://github.com/shivaram/spark-ml/blob/master/src/main/scala/ml/MnistRandomFFTPipeline.scala
 . 
    
    Thanks for testing with a real pipeline!
    
    > Note that I tried to write this as an application layer on top of Spark 
as I think our primary goal is to enable custom pipelines, not just those using 
our provided components.
    
    Exactly.
    
     > But as the SQL component itself is pretty new, I think my skill set 
maybe representative of the average Spark application developer.
    
    We are at the stage of learning Spark SQL as well. Note that this component 
will be marked as alpha.
    
    > Because we are using SchemaRDDs directly I found that there is 5-10 lines 
of boiler plate code in every transform or fit function. This is usually 
selecting one or more input columns and getting out an RDD which can be 
processed and then adding an output column at the end. It would be good if we 
can have wrappers which do this automatically (I have one proposal below)
    
    The Scala DSL is new. @liancheng helped add the implicit ".call" method to 
UDF. I think when we fully understand the operations we need, we can work with 
@marmbrus to add those wrappers.
    
    > The UDF interface or Row => Row feels pretty restrictive. For numerical 
computations I often want to do things like mapPartitions or use broadcast 
variables etc. In those cases I am not sure how to directly use UDFs.
    
    We are not restricted by UDFs. As you commented, we can create a new RDD 
and zip back.
    
    > And the main reason I used UDFs was to handle the semantics of appending 
an output column. Is there any other API other than using select Star(None) ... 
? It'd be great if we had something like dataset.zip() which is easier / more 
flexible ?
    
    I think the only boilerplate code here is `Star(None)`. The input columns 
and output column are required. I'm sure it is easy to add operations like
    
    ~~~
    dataset.appendCol(predict('features) as 'prediction)
    ~~~
    
    > One proposal I had for simplifying things is to define another 
transformer API which is less general but easier to use (we can keep the 
existing ones as base classes).
    
    I like this idea. We will also keep the `predict` and `predictValues` 
methods that operate on normal RDDs. Note that the main focus of this PR is 
user-facing APIs: how users create pipelines and specify parameters. @jkbradley 
is working on the developer side APIs.
    
    Pipelines API
    
    > Multiple batches of transformers -- One of the main problems I ran into 
was that I couldn't see an easy way to have loops inside a pipeline.
    
    We can create a transformer like the following:
    
    ~~~
    val featurePipeline = // the pipeline you created
    val par = new ParallelPipeline()
      .setPipeline(featurePipeline)
      .setParamMaps(Array(
        ParamMap(softMax.outputCol -> "f0")),
        ParamMap(softMax.outputCol -> "f1")),
        ParamMap(softMax.outputCol -> "f2"))))
    val fvAssembler = new FeatureVectorAssembler()
      .setInputCols(Array("f0", "f1", "f2))
    val pipeline = new Pipeline()
      .setStages(Array(featurePipeline, fvAssembler, linearSolver))
    ~~~
    
    > Passing parameters in Estimators -- I finally understood the problem you 
were describing originally ! However I solved it slightly differently that 
having 2 sets of parameters (This is in 
MultiClassLinearRegressionEstimator.scala). I made parent Estimator class take 
in as params all the parameters required for the child Transformer class as 
well and then passed them along. I thought this was cleaner than having 
modelParams and params as two different members.
    
    Yes, this is cleaner. So we separate Estimator from the output Model but 
both shares the same set of params. I like this idea and then we don't need 
`trainingParams`:)
    
    > Parameters vs. Constructors -- While I agree with your comment that 
constructor arguments make binary compatibility tricky, I ran into a couple of 
cases where creating a transformer without a particular argument doesn't make 
sense. Will we have a guideline that things which won't vary should be 
constructor arguments ? I think it'll be good to come up with some distinction 
that makes it clear for the programmer.
    
    Could you list a few examples? Having many parameters is common for ML 
components. I feel it is hard to decide what parameters won't vary. My 
understanding is that `LogisticRegression` is just a placeholder with 
parameters to be configured. and validation will happen at runtime (not added 
yet).
    
    > Chaining evaluators to a Pipeline -- I wrote a simple evaluator that 
computed test error at the end but wasn't sure how to chain this to a pipeline. 
Is this something we don't intend to support ?
    
    You can wrap it with a model selector, like the `CrossValidator`, or maybe 
just train/validation split.
    
    > Strings, Vectors etc -- There were a couple of minor things that weren't 
ideal, but I think we will have to live with. With every node having an input 
and output column, there are lots of strings floating around which the user 
needs to get right. Also going from Spark's Vector / Matrix classes to Breeze 
and back to Spark classes gets tedious. We could add commonly used operators 
like add, multiply etc. to our public APIs and that might help a bit.
    
    1. If you are only dealing with one features column, you can keep 
"features" as the output column name (replacing the input column).
    2. If you are dealing with multiple columns, we have to deal with column 
names. We can avoid having strings floating around by: 
    ~~~
        val randomSignNode = RandomSignNode.create(d, randomSignSource)
          .setInputCol("features")
          .setOutputCol("randomFeatures")
    
        val fftTransform = new FFTransform()
          .setInputCol(randomSignNode.getInputCol) // changed this line
          .setOutputCol("fftFeatures")
    ~~~
    3. Another solution is to let Pipeline generates the input and output 
column names.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

Reply via email to