peter-toth commented on PR #54558:
URL: https://github.com/apache/spark/pull/54558#issuecomment-3977154855

   
   > > Yes, I agree that recovering from an OOM is a huge win.
   > > My question is mainly about subsequent stages. If there is no resource 
profile set for them, will/might those stages use the 1 core executor? If yes, 
then I still consider the PR a nice improvement just we probably need to call 
out this behaviour in our documentation or in config description so that users 
could decide whether they want their jobs to fail fast or complete with maybe 
increased runtime.
   > 
   > I understand your requirement of fail-fast. Technically, you want to give 
the users the right to disable the whole feature, right?
   
   I believe the new `spark.kubernetes.allocation.recoveryMode.enabled` config 
is good way to disable enable/disable the feature, but its description `If 
true, enables the recovery mode during executor allocation.` might need a 
better explanation what it exactly means; and maybe some additional notes that 
the recovery mode 1 core executor can stick for a while and might affect the 
remaining stages of a job.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to