Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

Liang-Chi Hsieh Thu, 02 Feb 2017 23:35:21 -0800

Hi Maciej,

FYI, this fix is submitted at https://github.com/apache/spark/pull/16785.



Liang-Chi Hsieh wrote
> Hi Maciej,
> 
> After looking into the details of the time spent on preparing the executed
> plan, the cause of the significant difference between 1.6 and current
> codebase when running the example, is the optimization process to generate
> constraints.
> 
> There seems few operations in generating constraints which are not
> optimized. Plus the fact the query plan grows continuously, the time spent
> on generating constraints increases more and more.
> 
> I am trying to reduce the time cost. Although not as low as 1.6 because we
> can't remove the process of generating constraints, it is significantly
> lower than current codebase (74294 ms -> 2573 ms).
> 
> 385 ms
> 107 ms
> 46 ms
> 58 ms
> 64 ms
> 105 ms
> 86 ms
> 122 ms
> 115 ms
> 114 ms
> 100 ms
> 109 ms
> 169 ms
> 196 ms
> 174 ms
> 212 ms
> 290 ms
> 254 ms
> 318 ms
> 405 ms
> 347 ms
> 443 ms
> 432 ms
> 500 ms
> 544 ms
> 619 ms
> 697 ms
> 683 ms
> 807 ms
> 802 ms
> 960 ms
> 1010 ms
> 1155 ms
> 1251 ms
> 1298 ms
> 1388 ms
> 1503 ms
> 1613 ms
> 2279 ms
> 2349 ms
> 2573 ms
> 
> Liang-Chi Hsieh wrote
>> Hi Maciej,
>> 
>> Thanks for the info you provided.
>> 
>> I tried to run the same example with 1.6 and current branch and record
>> the difference between the time cost on preparing the executed plan.
>> 
>> Current branch:
>> 
>> 292 ms                                                                       
>>       
>> 95 ms                             
>> 57 ms
>> 34 ms
>> 128 ms
>> 120 ms
>> 63 ms
>> 106 ms
>> 179 ms
>> 159 ms
>> 235 ms
>> 260 ms
>> 334 ms
>> 464 ms
>> 547 ms                             
>> 719 ms
>> 942 ms
>> 1130 ms
>> 1928 ms
>> 1751 ms
>> 2159 ms                            
>> 2767 ms
>> 3333 ms
>> 4175 ms
>> 5106 ms
>> 6269 ms
>> 7683 ms
>> 9210 ms
>> 10931 ms
>> 13237 ms
>> 15651 ms
>> 19222 ms
>> 23841 ms
>> 26135 ms
>> 31299 ms
>> 38437 ms
>> 47392 ms
>> 51420 ms
>> 60285 ms
>> 69840 ms
>> 74294 ms
>> 
>> 1.6:
>> 
>> 3 ms
>> 4 ms
>> 10 ms
>> 4 ms
>> 17 ms
>> 8 ms
>> 12 ms
>> 21 ms
>> 15 ms
>> 15 ms
>> 19 ms
>> 23 ms
>> 28 ms
>> 28 ms
>> 58 ms
>> 39 ms
>> 43 ms
>> 61 ms
>> 56 ms
>> 60 ms
>> 81 ms
>> 73 ms
>> 100 ms
>> 91 ms
>> 96 ms
>> 116 ms
>> 111 ms
>> 140 ms
>> 127 ms
>> 142 ms
>> 148 ms
>> 165 ms
>> 171 ms
>> 198 ms
>> 200 ms
>> 233 ms
>> 237 ms
>> 253 ms
>> 256 ms
>> 271 ms
>> 292 ms
>> 452 ms
>> 
>> Although they both take more time after each iteration due to the grown
>> query plan, it is obvious that current branch takes much more time than
>> 1.6 branch. The optimizer and query planning in current branch is much
>> more complicated than 1.6.
>> zero323 wrote
>>> Hi Liang-Chi,
>>> 
>>> Thank you for your answer and PR but what I think I wasn't specific
>>> enough. In hindsight I should have illustrate this better. What really
>>> troubles me here is a pattern of growing delays. Difference between
>>> 1.6.3 (roughly 20s runtime since the first job):
>>> 
>>> 
>>> 1.6 timeline
>>> 
>>> vs 2.1.0 (45 minutes or so in a bad case):
>>> 
>>> 2.1.0 timeline
>>> 
>>> The code is just an example and it is intentionally dumb. You easily
>>> mask this with caching, or using significantly larger data sets. So I
>>> guess the question I am really interested in is - what changed between
>>> 1.6.3 and 2.x (this is more or less consistent across 2.0, 2.1 and
>>> current master) to cause this and more important, is it a feature or is
>>> it a bug? I admit, I choose a lazy path here, and didn't spend much time
>>> (yet) trying to dig deeper.
>>> 
>>> I can see a bit higher memory usage, a bit more intensive GC activity,
>>> but nothing I would really blame for this behavior, and duration of
>>> individual jobs is comparable with some favor of 2.x. Neither
>>> StringIndexer nor OneHotEncoder changed much in 2.x. They used RDDs for
>>> fitting in 1.6 and, as far as I can tell, they still do that in 2.x. And
>>> the problem doesn't look that related to the data processing part in the
>>> first place.
>>> 
>>> 
>>> On 02/02/2017 07:22 AM, Liang-Chi Hsieh wrote:
>>>> Hi Maciej,
>>>>
>>>> FYI, the PR is at https://github.com/apache/spark/pull/16775.
>>>>
>>>>
>>>> Liang-Chi Hsieh wrote
>>>>> Hi Maciej,
>>>>>
>>>>> Basically the fitting algorithm in Pipeline is an iterative operation.
>>>>> Running iterative algorithm on Dataset would have RDD lineages and
>>>>> query
>>>>> plans that grow fast. Without cache and checkpoint, it gets slower
>>>>> when
>>>>> the iteration number increases.
>>>>>
>>>>> I think it is why when you run a Pipeline with long stages, it gets
>>>>> much
>>>>> longer time to finish. As I think it is not uncommon to have long
>>>>> stages
>>>>> in a Pipeline, we should improve this. I will submit a PR for this.
>>>>> zero323 wrote
>>>>>> Hi everyone,
>>>>>>
>>>>>> While experimenting with ML pipelines I experience a significant
>>>>>> performance regression when switching from 1.6.x to 2.x.
>>>>>>
>>>>>> import org.apache.spark.ml.{Pipeline, PipelineStage}
>>>>>> import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer,
>>>>>> VectorAssembler}
>>>>>>
>>>>>> val df = (1 to 40).foldLeft(Seq((1, "foo"), (2, "bar"), (3,
>>>>>> "baz")).toDF("id", "x0"))((df, i) => df.withColumn(s"x$i", $"x0"))
>>>>>> val indexers = df.columns.tail.map(c => new StringIndexer()
>>>>>>   .setInputCol(c)
>>>>>>   .setOutputCol(s"${c}_indexed")
>>>>>>   .setHandleInvalid("skip"))
>>>>>>
>>>>>> val encoders = indexers.map(indexer => new OneHotEncoder()
>>>>>>   .setInputCol(indexer.getOutputCol)
>>>>>>   .setOutputCol(s"${indexer.getOutputCol}_encoded")
>>>>>>   .setDropLast(true))
>>>>>>
>>>>>> val assembler = new
>>>>>> VectorAssembler().setInputCols(encoders.map(_.getOutputCol))
>>>>>> val stages: Array[PipelineStage] = indexers ++ encoders :+ assembler
>>>>>>
>>>>>> new Pipeline().setStages(stages).fit(df).transform(df).show
>>>>>>
>>>>>> Task execution time is comparable and executors are most of the time
>>>>>> idle so it looks like it is a problem with the optimizer. Is it a
>>>>>> known
>>>>>> issue? Are there any changes I've missed, that could lead to this
>>>>>> behavior?
>>>>>>
>>>>>> -- 
>>>>>> Best,
>>>>>> Maciej
>>>>>>
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: 
>>>>>> dev-unsubscribe@.apache
>>>>
>>>>
>>>>
>>>>
>>>> -----
>>>> Liang-Chi Hsieh | @viirya 
>>>> Spark Technology Center 
>>>> http://www.spark.tc/ 
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20822.html
>>>> Sent from the Apache Spark Developers List mailing list archive at
>>>> Nabble.com.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: 

>>> dev-unsubscribe@.apache

>>>>
>>> 
>>> -- 
>>> Maciej Szymkiewicz
>>> 
>>> 
>>> 
>>> nM15AWH.png (19K)
>>> &lt;http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/0/nM15AWH.png&gt;
>>> KHZa7hL.png (26K)
>>> &lt;http://apache-spark-developers-list.1001551.n3.nabble.com/attachment/20827/1/KHZa7hL.png&gt;





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-tp20803p20837.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [SQL][ML] Pipeline performance regression between 1.6 and 2.x

Reply via email to