Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

via GitHub Mon, 13 Nov 2023 10:31:09 -0800


tgravescs commented on PR #43746:
URL: https://github.com/apache/spark/pull/43746#issuecomment-1808766570


   > spark.executor.maxFailures
   
   Oh I just realized you added this config - ie ported the yarn feature to k8s 
and I think you mean spark.executor.maxNumFailures. I had missed this go by.
   
   >  It failed because it hit the max executor failures while the root cause 
was one of the shared UDF jar changed by a developer, who turned out not to be 
the app owner. Yarn failed to bring up new executors, so the 20 failures were 
collected within 10 secs.
   
   If users changes a jar mid application, this is really bad IMHO.  How do you 
know your application doesn't get different results on different executors.  
Say that had actually worked but the logic changed in the udf.   This to me is 
a process side of things and Spark did the right thing in failing and it should 
have failed.  Would you have known as quickly if it hadn't failed that someone 
pushed a bad jar?  I assume maybe next application run sometime later but it 
still would have caused some app to fail.
   
   
   > The probability of apps failing with executor max failures is low for the 
total amount apps. But it turns out to be a daily issue
   
   I'm not sure I follow this statement, you see this kind of issue daily and 
its because users push bad jars that much or why do you see it daily?  I'm 
trying to understand how much this is really a problem that Spark should be 
solving.  Do you see failures where having the feature on actually helps you?  
I kind of assume so since you ported it to k8s but if not just turn it off.  
   
   I can see a reliability aspect here that if you have a sufficient number of 
executors already allocated and running, then just keep running instead of 
killing the entire application.   How you achieve that though vs this proposal 
I'm not sure I agree with. If user set a minimum number of executors, why isn't 
this just that number? As one of the other comments stated this approach is 
useless for normal users with dynamic allocation so why doesn't it apply to 
that case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-45873][CORE][YARN][K8S] Make ExecutorFailureTracker more tolerant when app remains sufficient resources [spark]

Reply via email to