Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

Liang-Chi Hsieh Thu, 02 Feb 2017 00:32:32 -0800

Thanks Nick for pointing it out. I totally agreed.

In 1.6 codebase, actually Pipeline uses DataFrame instead of Dataset,
because they are not merged yet in 1.6.


In StringIndexer and OneHotEncoder, they have called ".rdd" on the Dataset,
this would deserialize the rows.

In 1.6, as they use DataFrame, there is no extra cost for deserialization.

I think this would cause some regression. As Maciej didn't show how much
performance regression observed, I can't judge if this is the root cause for
it. But this is the initial idea after I check 1.6 and current Pipeline.



Nick Pentreath wrote
> Hi Maciej
> 
> If you're seeing a regression from 1.6 -> 2.0 *both using DataFrames *then
> that seems to point to some other underlying issue as the root cause.
> 
> Even though adding checkpointing should help, we should understand why
> it's
> different between 1.6 and 2.0?
> 
> 
> On Thu, 2 Feb 2017 at 08:22 Liang-Chi Hsieh &lt;

> viirya@

> &gt; wrote:
> 
>>
>> Hi Maciej,
>>
>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>
>>
>> Liang-Chi Hsieh wrote
>> > Hi Maciej,
>> >
>> > Basically the fitting algorithm in Pipeline is an iterative operation.
>> > Running iterative algorithm on Dataset would have RDD lineages and
>> query
>> > plans that grow fast. Without cache and checkpoint, it gets slower when
>> > the iteration number increases.
>> >
>> > I think it is why when you run a Pipeline with long stages, it gets
>> much
>> > longer time to finish. As I think it is not uncommon to have long
>> stages
>> > in a Pipeline, we should improve this. I will submit a PR for this.
>> > zero323 wrote
>> >> Hi everyone,
>> >>
>> >> While experimenting with ML pipelines I experience a significant
>> >> performance regression when switching from 1.6.x to 2.x.
>> >>
>> >> import org.apache.spark.ml.{Pipeline, PipelineStage}
>> >> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>> >> VectorAssembler}
>> >>
>> >> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>> >> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>> >> val indexers = df.columns.tail.map(c => new StringIndexer()
>> >>   .setInputCol(c)
>> >>   .setOutputCol(s"${c}_indexed")
>> >>   .setHandleInvalid("skip"))
>> >>
>> >> val encoders = indexers.map(indexer => new OneHotEncoder()
>> >>   .setInputCol(indexer.getOutputCol)
>> >>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>> >>   .setDropLast(true))
>> >>
>> >> val assembler = new
>> >> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>> >> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>> >>
>> >> new Pipeline().setStages(stages).fit(df).transform(df).show
>> >>
>> >> Task execution time is comparable and executors are most of the time
>> >> idle so it looks like it is a problem with the optimizer. Is it a
>> known
>> >> issue? Are there any changes I've missed, that could lead to this
>> >> behavior?
>> >>
>> >> --
>> >> Best,
>> >> Maciej
>> >>
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe e-mail:
>>
>> >> dev-unsubscribe@.apache
>>
>>
>>
>>
>>
>> -----
>> Liang-Chi Hsieh | @viirya
>> Spark Technology Center
>> http://www.spark.tc/
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

> dev-unsubscribe@.apache

>>
>>





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20825.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

Reply via email to