gaoyajun02 commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1300369485

   > For cases like this, it might actually be better to fail the task (and 
recompute the parent stage) - and leverage deny list to prevent tasks from 
running on the problematic node ?
   
   I think it is not necessary to recalculate the parent stage. This situation 
is similar to chunk corruption. We can fallback obtain shuffle block. The 
reasons are:
   1. Obtain shuffle blocks are available
   2. Recomputing the parent stage is very expensive in some large jobs and 
will make the application execution time longer
   3. We observed that these bad nodes lost data and changed very low, it may 
only appear once a few days


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to