weixiuli edited a comment on pull request #35425: URL: https://github.com/apache/spark/pull/35425#issuecomment-1035945019
> This reverts [SPARK-36414](https://issues.apache.org/jira/browse/SPARK-36414), right? This does not revert [SPARK-36414](https://issues.apache.org/jira/browse/SPARK-36414),it keeps the disable timeout for broadcast stages that is converted from shuffle in AQE. > One idea is to make the broadcast itself dynamic: it should cancel the job if it has already collected much data at the driver side. This is a good idea, in fact, JD production has done that by checkinng whether broadcast Stages have tasks running In non-AQE. If a broadcast stage timeout with no one task running, means that it is not scheduled and should retry wait(we use spark.sql.broadcastMaxRetries to do that), if a broadcast stage timeout with some tasks running, we should cancel the job. What do you think about the mechanism above to use for AQE broadcast stages(not converted from shuffle)? @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
