Hi Maciej, FYI, the PR is at https://github.com/apache/spark/pull/16775.
Liang-Chi Hsieh wrote > Hi Maciej, > > Basically the fitting algorithm in Pipeline is an iterative operation. > Running iterative algorithm on Dataset would have RDD lineages and query > plans that grow fast. Without cache and checkpoint, it gets slower when > the iteration number increases. > > I think it is why when you run a Pipeline with long stages, it gets much > longer time to finish. As I think it is not uncommon to have long stages > in a Pipeline, we should improve this. I will submit a PR for this. > zero323 wrote >> Hi everyone, >> >> While experimenting with ML pipelines I experience a significant >> performance regression when switching from 1.6.x to 2.x. >> >> import org.apache.spark.ml.{Pipeline, PipelineStage} >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, >> VectorAssembler} >> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) >> val indexers = df.columns.tail.map(c => new StringIndexer() >> .setInputCol(c) >> .setOutputCol(s"${c}_indexed") >> .setHandleInvalid("skip")) >> >> val encoders = indexers.map(indexer => new OneHotEncoder() >> .setInputCol(indexer.getOutputCol) >> .setOutputCol(s"${indexer.getOutputCol}_encoded") >> .setDropLast(true)) >> >> val assembler = new >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol)) >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler >> >> new Pipeline().setStages(stages).fit(df).transform(df).show >> >> Task execution time is comparable and executors are most of the time >> idle so it looks like it is a problem with the optimizer. Is it a known >> issue? Are there any changes I've missed, that could lead to this >> behavior? >> >> -- >> Best, >> Maciej >> >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: >> dev-unsubscribe@.apache ----- Liang-Chi Hsieh | @viirya Spark Technology Center http://www.spark.tc/ -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org