[
https://issues.apache.org/jira/browse/SPARK-40499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17683640#comment-17683640
]
xuanzhiang commented on SPARK-40499:
------------------------------------
now when we use PERCENTILE_APPROX, we need disable objHashAggregate. Or we
choose use PERCENTILE,it use HashAggregate and shuffle normal.
> Spark 3.2.1 percentlie_approx query much slower than Spark 2.4.0
> ----------------------------------------------------------------
>
> Key: SPARK-40499
> URL: https://issues.apache.org/jira/browse/SPARK-40499
> Project: Spark
> Issue Type: Bug
> Components: Shuffle
> Affects Versions: 3.2.1
> Environment: hadoop: 3.0.0
> spark: 2.4.0 / 3.2.1
> shuffle:spark 2.4.0
> Reporter: xuanzhiang
> Priority: Major
> Attachments: spark2.4-shuffle-data.png, spark3.2-shuffle-data.png
>
>
> spark.sql(
> s"""
> |SELECT
> | Info ,
> | PERCENTILE_APPROX(cost,0.5) cost_p50,
> | PERCENTILE_APPROX(cost,0.9) cost_p90,
> | PERCENTILE_APPROX(cost,0.95) cost_p95,
> | PERCENTILE_APPROX(cost,0.99) cost_p99,
> | PERCENTILE_APPROX(cost,0.999) cost_p999
> |FROM
> | textData
> |""".stripMargin)
> * When we used spark 2.4.0, aggregation adopted objHashAggregator, stage 2
> pull shuffle data very quick . but , when we use spark 3.2.1 and use old
> shuffle , 140M shuffle data cost 3 hours.
> * If we upgrade the Shuffle, will we get performance regression?
> *
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]