[ https://issues.apache.org/jira/browse/SPARK-47836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tanel Kiis updated SPARK-47836: ------------------------------- Description: SPARK-29336 caused a severe performance regression. In practice a partial_aggregate with several approx_percentile calls ran less than hour and the final aggrergation after exchange would have taken over a week. Simple percentile ran about the same time in the first part and the final aggregate ran very quickly. I made a benchmark, and it reveals that the merge operation is very-very slow: https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e >From my experiments it looks like it is n^2 with the number of partitions >(number of partial aggregations to merge). When I reverted the changes made in this PR, then the "Only insert" and "Insert & merge" were very similar. The cause seems to be, that compressImmut does not reduce the number samples allmost at all after merges and just keeps iterating over an evergrowing list. I was not able to figure out how to fix the issue without just reverting the PR. I also added a benchmark with KllDoublesSketch from the apache datasketches project and it worked even better than this class before this PR. Only downside was that it is not-deterministic. > Performance problem with QuantileSummaries > ------------------------------------------ > > Key: SPARK-47836 > URL: https://issues.apache.org/jira/browse/SPARK-47836 > Project: Spark > Issue Type: Improvement > Components: SQL > Affects Versions: 3.5.1 > Reporter: Tanel Kiis > Priority: Major > > SPARK-29336 caused a severe performance regression. > In practice a partial_aggregate with several approx_percentile calls ran less > than hour and the final aggrergation after exchange would have taken over a > week. > Simple percentile ran about the same time in the first part and the final > aggregate ran very quickly. > I made a benchmark, and it reveals that the merge operation is very-very > slow: > https://github.com/tanelk/spark/commit/3b16f429a77b10003572295f42361fbfb2f3c63e > From my experiments it looks like it is n^2 with the number of partitions > (number of partial aggregations to merge). > When I reverted the changes made in this PR, then the "Only insert" and > "Insert & merge" were very similar. > The cause seems to be, that compressImmut does not reduce the number samples > allmost at all after merges and just keeps iterating over an evergrowing list. > I was not able to figure out how to fix the issue without just reverting the > PR. > I also added a benchmark with KllDoublesSketch from the apache datasketches > project and it worked even better than this class before this PR. > Only downside was that it is not-deterministic. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org