weixiuli commented on pull request #35425:
URL: https://github.com/apache/spark/pull/35425#issuecomment-1035945019


   
   
   
   
   > This reverts 
[SPARK-36414](https://issues.apache.org/jira/browse/SPARK-36414), right? 
   
   This does not revert  
[SPARK-36414](https://issues.apache.org/jira/browse/SPARK-36414),it keep 
disable timeout for broadcast stages that is converted from  shuffle  in AQE.
   > One idea is to make the broadcast itself dynamic: it should cancel the job 
if it has already collected much data at the driver side.
   
   This is a good idea, in fact, JD production has done that by checkinng 
whether broadcast Stages have tasks running In non-AQE.  If a broadcast stage 
timeout with no one task running, means that it is not scheduled and should 
retry wait(we use spark.sql.broadcastMaxRetries to do that), if a broadcast 
stage timeout with some tasks running, we should cancel the job.
   
   What do you think about the mechanism above to use for AQE  broadcast 
stages(not converted from shuffle)?  @cloud-fan  
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to