c21 commented on pull request #31708: URL: https://github.com/apache/spark/pull/31708#issuecomment-789378350
After checking, the plan change in TPCDS q61 and q90 are valid. The plan change is the outermost shuffle no longer existed. These two queries are doing a broadcast nested loop join, and then doing a `ORDER BY` (global sort). The shuffle with range partitioning is avoided because we are preserving join streamed side partitioning. Regenerated the golden files for these two queries' plans with ``` SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q61)" SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- -z (tpcds-v1.4/q90)" SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite -- -z (tpcds-v1.4/q61)" SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilityWithStatsSuite -- -z (tpcds-v1.4/q90)" ``` Did a micro benchmark on laptop (TPCDS with scale factor 10). Verified there is no performance regression for these two queries. Runtime reduction: q61 (4386ms to 4257ms), and q90 (764ms to 680ms). Given the eliminated shuffle only processing very small amount of data, the runtime reduction here might just be noise. But anyway we get a better query plan here. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
