[SQL][ML] Pipeline performance regression between 1.6 and 2.x

Maciej Szymkiewicz Tue, 31 Jan 2017 07:07:34 -0800

Hi everyone,

While experimenting with ML pipelines I experience a significant
performance regression when switching from 1.6.x to 2.x.


import org.apache.spark.ml.{Pipeline, PipelineStage}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
VectorAssembler}

val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
"baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
val indexers = df.columns.tail.map(c => new StringIndexer()
  .setInputCol(c)
  .setOutputCol(s"${c}_indexed")
  .setHandleInvalid("skip"))

val encoders = indexers.map(indexer => new OneHotEncoder()
  .setInputCol(indexer.getOutputCol)
  .setOutputCol(s"${indexer.getOutputCol}_encoded")
  .setDropLast(true))

val assembler = new
VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler

new Pipeline().setStages(stages).fit(df).transform(df).show

Task execution time is comparable and executors are most of the time
idle so it looks like it is a problem with the optimizer. Is it a known
issue? Are there any changes I've missed, that could lead to this behavior?

-- 
Best,
Maciej


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[SQL][ML] Pipeline performance regression between 1.6 and 2.x

Reply via email to