[ https://issues.apache.org/jira/browse/SPARK-19797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892189#comment-15892189 ]
Zhe Sun commented on SPARK-19797: --------------------------------- Hi Sean, thanks for your quick reply. bq. If the Pipeline had more stages, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. Let's use IDF as an example. If the pipeline is like: bq. Tokenizer -> HashingTF -> IDF -> LogisticRegression When we fit this pipeline, *IDF* will first call _fit_, then call _transform_ and pass the idf result to LogisticRegression. Because LogisticRegression is an Estimator and _fit_ of LogisticRegression needs the data from _transformer_ of *IDF*. However, if the last stage of pipeline is Normalizer (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer) bq. Tokenizer -> HashingTF -> IDF -> Normalizer When fitting this pipeline, *IDF* will only call _fit_, and do not need to call _transform_ That's why I think it is better to correct the description as bq. If the Pipeline had more Estimators, it would call the LogisticRegressionModel’s transform() method on the DataFrame before passing the DataFrame to the next stage. > ML pipelines document error > --------------------------- > > Key: SPARK-19797 > URL: https://issues.apache.org/jira/browse/SPARK-19797 > Project: Spark > Issue Type: Bug > Components: ML > Affects Versions: 2.1.0 > Reporter: Zhe Sun > Priority: Trivial > Labels: documentation > Original Estimate: 5m > Remaining Estimate: 5m > > Description about pipeline in this paragraph is incorrect > https://spark.apache.org/docs/latest/ml-pipeline.html#how-it-works, which > misleads the user > bq. If the Pipeline had more *stages*, it would call the > LogisticRegressionModel’s transform() method on the DataFrame before passing > the DataFrame to the next stage. > The description is not accurate, because *Transformer* could also be a stage. > But only another Estimator will invoke an extra transform call. > So, the description should be corrected as: *If the Pipeline had more > _Estimators_*. > The code to prove it is here > https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/Pipeline.scala#L160 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org