Hi Maciej,

FYI, the PR is at https://github.com/apache/spark/pull/16775.


Liang-Chi Hsieh wrote
> Hi Maciej,
> 
> Basically the fitting algorithm in Pipeline is an iterative operation.
> Running iterative algorithm on Dataset would have RDD lineages and query
> plans that grow fast. Without cache and checkpoint, it gets slower when
> the iteration number increases.
> 
> I think it is why when you run a Pipeline with long stages, it gets much
> longer time to finish. As I think it is not uncommon to have long stages
> in a Pipeline, we should improve this. I will submit a PR for this.
> zero323 wrote
>> Hi everyone,
>> 
>> While experimenting with ML pipelines I experience a significant
>> performance regression when switching from 1.6.x to 2.x.
>> 
>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>> VectorAssembler}
>> 
>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>   .setInputCol(c)
>>   .setOutputCol(s"${c}_indexed")
>>   .setHandleInvalid("skip"))
>> 
>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>   .setInputCol(indexer.getOutputCol)
>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>   .setDropLast(true))
>> 
>> val assembler = new
>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>> 
>> new Pipeline().setStages(stages).fit(df).transform(df).show
>> 
>> Task execution time is comparable and executors are most of the time
>> idle so it looks like it is a problem with the optimizer. Is it a known
>> issue? Are there any changes I've missed, that could lead to this
>> behavior?
>> 
>> -- 
>> Best,
>> Maciej
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

>> dev-unsubscribe@.apache





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to