[ 
https://issues.apache.org/jira/browse/SPARK-47836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanel Kiis updated SPARK-47836:
-------------------------------
    Description: 
SPARK-29336 caused a severe performance regression.
In practice a partial_aggregate with several approx_percentile calls ran less 
than hour and the final aggrergation after exchange would have taken over a 
week.
Simple percentile ran about the same time in the first part and the final 
aggregate ran very quickly.

I made a benchmark, and it reveals that the merge operation is very-very slow: 
https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
>From my experiments it looks like it is n^2 with the number of partitions 
>(number of partial aggregations to merge).
When I reverted the changes made in this PR, then the "Only insert" and "Insert 
& merge" were very similar.

The cause seems to be, that compressImmut does not reduce the number samples 
allmost at all after merges and just keeps iterating over an evergrowing list.
I was not able to figure out how to fix the issue without just reverting the PR.

I also added a benchmark with KllDoublesSketch from the apache datasketches 
project and it worked even better than this class before this PR.
Only downside was that it is not-deterministic. 

> Performance problem with QuantileSummaries
> ------------------------------------------
>
>                 Key: SPARK-47836
>                 URL: https://issues.apache.org/jira/browse/SPARK-47836
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.5.1
>            Reporter: Tanel Kiis
>            Priority: Major
>
> SPARK-29336 caused a severe performance regression.
> In practice a partial_aggregate with several approx_percentile calls ran less 
> than hour and the final aggrergation after exchange would have taken over a 
> week.
> Simple percentile ran about the same time in the first part and the final 
> aggregate ran very quickly.
> I made a benchmark, and it reveals that the merge operation is very-very 
> slow: 
> https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e
> From my experiments it looks like it is n^2 with the number of partitions 
> (number of partial aggregations to merge).
> When I reverted the changes made in this PR, then the "Only insert" and 
> "Insert & merge" were very similar.
> The cause seems to be, that compressImmut does not reduce the number samples 
> allmost at all after merges and just keeps iterating over an evergrowing list.
> I was not able to figure out how to fix the issue without just reverting the 
> PR.
> I also added a benchmark with KllDoublesSketch from the apache datasketches 
> project and it worked even better than this class before this PR.
> Only downside was that it is not-deterministic. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to