Hi everyone, While experimenting with ML pipelines I experience a significant performance regression when switching from 1.6.x to 2.x.
import org.apache.spark.ml.{Pipeline, PipelineStage} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer, VectorAssembler} val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3, "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0")) val indexers = df.columns.tail.map(c => new StringIndexer() .setInputCol(c) .setOutputCol(s"${c}_indexed") .setHandleInvalid("skip")) val encoders = indexers.map(indexer => new OneHotEncoder() .setInputCol(indexer.getOutputCol) .setOutputCol(s"${indexer.getOutputCol}_encoded") .setDropLast(true)) val assembler = new VectorAssembler().setInputCols(encoders.map(_.getOutputCol)) val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler new Pipeline().setStages(stages).fit(df).transform(df).show Task execution time is comparable and executors are most of the time idle so it looks like it is a problem with the optimizer. Is it a known issue? Are there any changes I've missed, that could lead to this behavior? -- Best, Maciej --------------------------------------------------------------------- To unsubscribe e-mail: dev-unsubscr...@spark.apache.org