peter-toth commented on PR #54558:
URL: https://github.com/apache/spark/pull/54558#issuecomment-3976717458

   > > * What happens after a successful recovery? Will the remaining 
stages/tasks use the 1 core executor?
   > 
   > 
   > 
   > Here, the successful recovery should be considered as a job completions. 
If OOM kills executors one by one consecutively due to the re-try, the jobs 
fail eventually without moving to the next stage. And, if we consider only a 
single stage, yes. The set of executors will not change further if there is no 
executor loss.
   
   Yes, I agree that recovering from an OOM is a huge win.
   My question is mainly about subsequent stages. If there is no resource 
profile set for them, will/might those stages use the 1 core executor? If yes, 
then I still consider the PR a nice improvement just we probably need to call 
out this behaviour in our documentation so that users could decide whether they 
want their jobs to fail fast or complete with maybe increased runtime.
   
   > > * What heppens if `spark.task.cpus` is set to >1?
   > 
   > 
   > 
   > Ya, that's the valid corner case. Let me disable this feature for that 
configuration.
   
   Would it make sense to set `spark.task.cpus` number of CPUs in recovery mode 
and so don't disable it if >1?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to