Hi Maciej,
FYI, the PR is at https://github.com/apache/spark/pull/16775.
Liang-Chi Hsieh wrote
> Hi Maciej,
>
> Basically the fitting algorithm in Pipeline is an iterative operation.
> Running iterative algorithm on Dataset would have RDD lineages and query
> plans that grow fast. Without cache and checkpoint, it gets slower when
> the iteration number increases.
>
> I think it is why when you run a Pipeline with long stages, it gets much
> longer time to finish. As I think it is not uncommon to have long stages
> in a Pipeline, we should improve this. I will submit a PR for this.
> zero323 wrote
>> Hi everyone,
>>
>> While experimenting with ML pipelines I experience a significant
>> performance regression when switching from 1.6.x to 2.x.
>>
>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>> VectorAssembler}
>>
>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>> val indexers = df.columns.tail.map(c => new StringIndexer()
>> .setInputCol(c)
>> .setOutputCol(s"${c}_indexed")
>> .setHandleInvalid("skip"))
>>
>> val encoders = indexers.map(indexer => new OneHotEncoder()
>> .setInputCol(indexer.getOutputCol)
>> .setOutputCol(s"${indexer.getOutputCol}_encoded")
>> .setDropLast(true))
>>
>> val assembler = new
>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>
>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>
>> Task execution time is comparable and executors are most of the time
>> idle so it looks like it is a problem with the optimizer. Is it a known
>> issue? Are there any changes I've missed, that could lead to this
>> behavior?
>>
>> --
>> Best,
>> Maciej
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:
>> [email protected]
-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context:
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]