cloud-fan commented on pull request #35425:
URL: https://github.com/apache/spark/pull/35425#issuecomment-1035950763


   > it keeps the disable timeout for broadcast stages that is converted from 
shuffle in AQE.
   
   This is not sufficient. The timeout can never be accurate because the 
scheduler is a black box to the SQL engine. We don't know what happens there 
and are not sure we should keep waiting or give up.
   
   The same problem applies to your proposal as well. Maybe the broadcast job 
has been waiting for 10 mins and submits the first task, then hits timeout and 
is killed.
   
   I think the broadcast job should take care of the data size, not time. The 
scheduler is a black box and time doesn't tell much. But data size is accurate. 
The broadcast job knows when one task completes and gets the result of this 
task. If the collected data size is too large, we can fail this broadcast job 
(or turn it back to shuffle join?).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to