tgravescs commented on PR #43746: URL: https://github.com/apache/spark/pull/43746#issuecomment-1808766570
> spark.executor.maxFailures Oh I just realized you added this config - ie ported the yarn feature to k8s and I think you mean spark.executor.maxNumFailures. I had missed this go by. > It failed because it hit the max executor failures while the root cause was one of the shared UDF jar changed by a developer, who turned out not to be the app owner. Yarn failed to bring up new executors, so the 20 failures were collected within 10 secs. If users changes a jar mid application, this is really bad IMHO. How do you know your application doesn't get different results on different executors. Say that had actually worked but the logic changed in the udf. This to me is a process side of things and Spark did the right thing in failing and it should have failed. Would you have known as quickly if it hadn't failed that someone pushed a bad jar? I assume maybe next application run sometime later but it still would have caused some app to fail. > The probability of apps failing with executor max failures is low for the total amount apps. But it turns out to be a daily issue I'm not sure I follow this statement, you see this kind of issue daily and its because users push bad jars that much or why do you see it daily? I'm trying to understand how much this is really a problem that Spark should be solving. Do you see failures where having the feature on actually helps you? I kind of assume so since you ported it to k8s but if not just turn it off. I can see a reliability aspect here that if you have a sufficient number of executors already allocated and running, then just keep running instead of killing the entire application. How you achieve that though vs this proposal I'm not sure I agree with. If user set a minimum number of executors, why isn't this just that number? As one of the other comments stated this approach is useless for normal users with dynamic allocation so why doesn't it apply to that case. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
