Github user shivaram commented on the pull request:

    https://github.com/apache/spark/pull/3099#issuecomment-61913282
  
    @mengxr Thanks for putting this together ! I had some high level comments 
by looking at the code
    
    1. Could we have constructors also with getter, setters ? This is the same 
style we use in MLLib and it will make user's code much cleaner. 
    ```    
    val scaler = new StandardScaler()
          .setInputCol("features")
          .setOutputCol("scaledFeatures")
        val lr = new LogisticRegression()
          .setFeaturesCol("scaledFeatures")
    ````
    
    will become one liners like
    
    ```
    val scaler = new StandardScaler(inputCol="features", 
outputCol="scaledFeatures")
    val lr = new LogisticRegression(featuresCol="scaledFeatures")
    ```
    
    Also we will still have an empty constructor that will just set the default 
values that exist for Params right now. 
    
    Also, FWIW I am not very sure about having setter, getters using traits 
like HasRegParam etc. This creates a lot of classes as we add more classes each 
having 2-3 params ? 
    
    2. Passing params across modules: The code to pass params from Estimator to 
Transformer is a little clunky right now. I think this can also be solved with 
the constructor interface above.
    
    For example in LogisticRegression.scala, we can have
    
    ```
    val lrm = new LogisticRegressionModel(lr.run(instances).weights)
    if (!lrm.paramMap.contains(lrm.featuresCol) && 
map.contains(lrm.featuresCol)) {
       lrm.setFeaturesCol(featuresCol)
    }
    ```
    can become
    ```
    val lrm = new LogisticRegressionModel(lr.run(instances).weights, 
featureCol=featuresCol)
    ```
    
    3. In addition to the SchemaRDD based transformer API, it would be good to 
have a single Row transformation API too. If you want to reuse the same 
transformers for training and for online prediction you would want to use the 
same Pipeline, but with 1 element at a time or something like that ? There are 
some folks in the AMPLab (cc @jegonzal @dcrankshaw) who are working on model 
serving and this would be useful for integration with such systems.
    Also I can see the single element API being very useful for unit testing.
    
    4. Other than the API, the biggest worry I have is in terms of memory 
management. Because most of the transformations can be lazy operations, it is 
very tricky to figure out when the programmer should call persist / unpersist. 
For example in StandardScaler.scala, we now have `input.cache()`. But AFAIK 
there is no simple way to do unpersist this RDD if the Scaler is being run 
lazily. 
    
    Also if you don't mind I'll try to port some of the image processing 
pipelines we have to this class structure by forking your branch. I feel 
that'll help me figure out if what features are easy to use etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to