[GitHub] [spark] zhongyu09 opened a new pull request #31084: [SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE

GitBox Thu, 07 Jan 2021 06:48:39 -0800


zhongyu09 opened a new pull request #31084:
URL: https://github.com/apache/spark/pull/31084



   This PR is the same as https://github.com/apache/spark/pull/30998 to merge 
to branch 3.0
   
   ### What changes were proposed in this pull request?
   In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, 
sort the new stages by class type to make sure BroadcastQueryState precede 
others.
   It can make sure the broadcast job are submitted before map jobs to avoid 
waiting for job schedule and cause broadcast timeout. 
   
   ### Why are the changes needed?
   When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan 
bottom up and create query stage for materialized part by createQueryStages and 
materialize those new created query stages to submit map stages or 
broadcasting. When ShuffleQueryStage are materializing before 
BroadcastQueryStage, the map job and broadcast job are submitted almost at the 
same time, but map job will hold all the computing resources. If the map job 
runs slow (when lots of data needs to process and the resource is limited), the 
broadcast job cannot be started(and finished) before 
spark.sql.broadcastTimeout, thus cause whole job failed (introduced in 
SPARK-31475).
   The workaround to increase spark.sql.broadcastTimeout doesn't make sense and 
graceful, because the data to broadcast is very small.
   
   
   ### Does this PR introduce _any_ user-facing change?
   NO
   
   
   ### How was this patch tested?
   1. Add UT
   2. Test the code using dev environment in 
https://issues.apache.org/jira/browse/SPARK-33933


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zhongyu09 opened a new pull request #31084: [SPARK-33933][SQL][3.0] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE

Reply via email to