gaoyajun02 commented on PR #38333: URL: https://github.com/apache/spark/pull/38333#issuecomment-1300369485
> For cases like this, it might actually be better to fail the task (and recompute the parent stage) - and leverage deny list to prevent tasks from running on the problematic node ? I think it is not necessary to recalculate the parent stage. This situation is similar to chunk corruption. We can fallback obtain shuffle block. The reasons are: 1. Obtain shuffle blocks are available 2. Recomputing the parent stage is very expensive in some large jobs and will make the application execution time longer 3. We observed that these bad nodes lost data and changed very low, it may only appear once a few days -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
