Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61913282 @mengxr Thanks for putting this together ! I had some high level comments by looking at the code 1. Could we have constructors also with getter, setters ? This is the same style we use in MLLib and it will make user's code much cleaner. ``` val scaler = new StandardScaler() .setInputCol("features") .setOutputCol("scaledFeatures") val lr = new LogisticRegression() .setFeaturesCol("scaledFeatures") ```` will become one liners like ``` val scaler = new StandardScaler(inputCol="features", outputCol="scaledFeatures") val lr = new LogisticRegression(featuresCol="scaledFeatures") ``` Also we will still have an empty constructor that will just set the default values that exist for Params right now. Also, FWIW I am not very sure about having setter, getters using traits like HasRegParam etc. This creates a lot of classes as we add more classes each having 2-3 params ? 2. Passing params across modules: The code to pass params from Estimator to Transformer is a little clunky right now. I think this can also be solved with the constructor interface above. For example in LogisticRegression.scala, we can have ``` val lrm = new LogisticRegressionModel(lr.run(instances).weights) if (!lrm.paramMap.contains(lrm.featuresCol) && map.contains(lrm.featuresCol)) { lrm.setFeaturesCol(featuresCol) } ``` can become ``` val lrm = new LogisticRegressionModel(lr.run(instances).weights, featureCol=featuresCol) ``` 3. In addition to the SchemaRDD based transformer API, it would be good to have a single Row transformation API too. If you want to reuse the same transformers for training and for online prediction you would want to use the same Pipeline, but with 1 element at a time or something like that ? There are some folks in the AMPLab (cc @jegonzal @dcrankshaw) who are working on model serving and this would be useful for integration with such systems. Also I can see the single element API being very useful for unit testing. 4. Other than the API, the biggest worry I have is in terms of memory management. Because most of the transformations can be lazy operations, it is very tricky to figure out when the programmer should call persist / unpersist. For example in StandardScaler.scala, we now have `input.cache()`. But AFAIK there is no simple way to do unpersist this RDD if the Scaler is being run lazily. Also if you don't mind I'll try to port some of the image processing pipelines we have to this class structure by forking your branch. I feel that'll help me figure out if what features are easy to use etc.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org