peter-toth commented on PR #54558: URL: https://github.com/apache/spark/pull/54558#issuecomment-3976717458
> > * What happens after a successful recovery? Will the remaining stages/tasks use the 1 core executor? > > > > Here, the successful recovery should be considered as a job completions. If OOM kills executors one by one consecutively due to the re-try, the jobs fail eventually without moving to the next stage. And, if we consider only a single stage, yes. The set of executors will not change further if there is no executor loss. Yes, I agree that recovering from an OOM is a huge win. My question is mainly about subsequent stages. If there is no resource profile set for them, will/might those stages use the 1 core executor? If yes, then I still consider the PR a nice improvement just we probably need to call out this behaviour in our documentation so that users could decide whether they want their jobs to fail fast or complete with maybe increased runtime. > > * What heppens if `spark.task.cpus` is set to >1? > > > > Ya, that's the valid corner case. Let me disable this feature for that configuration. Would it make sense to set `spark.task.cpus` number of CPUs in recovery mode and so don't disable it if >1? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
