c21 commented on pull request #31708:
URL: https://github.com/apache/spark/pull/31708#issuecomment-789378350


   After checking, the plan change in TPCDS q61 and q90 are valid. The plan 
change is the outermost shuffle no longer existed.
   These two queries are doing a broadcast nested loop join, and then doing a 
`ORDER BY` (global sort). The shuffle with range partitioning is avoided 
because we are preserving join streamed side partitioning.
   
   Regenerated the golden files for these two queries' plans with
   
   ```
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- 
-z (tpcds-v1.4/q61)"
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly *PlanStabilitySuite -- 
-z (tpcds-v1.4/q90)"
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
*PlanStabilityWithStatsSuite -- -z (tpcds-v1.4/q61)"
   SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly 
*PlanStabilityWithStatsSuite -- -z (tpcds-v1.4/q90)"
   ```
   
   Did a micro benchmark on laptop (TPCDS with scale factor 10). Verified there 
is no performance regression for these two queries. Runtime reduction: q61 
(4386ms to 4257ms), and q90 (764ms to 680ms). Given the eliminated shuffle only 
processing very small amount of data, the runtime reduction here might just be 
noise. But anyway we get a better query plan here.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to