[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

tgravescs Thu, 06 Oct 2016 09:46:39 -0700

Github user tgravescs commented on the issue:

    https://github.com/apache/spark/pull/15249
  
    Sorry I haven't followed this PR since it was split off the main one.  
Response might be a bit split as its talking about various responses if 
something doesn't make sense let me know.
    
    > b) A few seconds to 10's of seconds is usually enough if the problem is 
due to memory or disk |pressures.
    
    Seems this would vary a lot. I've seen resource pressures last for hours 
rather then seconds.  Another application has all disks pegged on the node 
while its doing a huge shuffle.  I'm not really sure on the memory side, I 
assume it was a skewed task and in this case other tasks finished and you just 
happened to have enough memory to finish now?  If this is the case why not just 
run it on another executor or node anyway, seems like odds would be about the 
same. I guess if you had the locality wait high enough it might not try another 
executor first or if you had a small enough number of executors it could be an 
issue.
    
    Really the temporary resource thing kind of falls into what I was talking 
about in the design with allowing more then 1 task attempt failure per executor 
(which is why I wanted it configurable).  On MapReduce we have seen this but on 
MR generally you have a few seconds because it has to relaunch an entire jvm. 
So one option that seems like its the same as the prior blacklisting would be 
to have that (spark.blacklist.task.maxTaskAttemptsPerExecutor) > 1 and add an 
additional timeout between attempts which would be basically same as 
spark.scheduler.executorTaskBlacklistTime.  thoughts?
    
    I think blacklisting executor is ok especially in cases where you have a 
bad disk because YARN should handle some of these cases for you and if you 
create another executor on that node, it could give you a different list of 
disks leaving out the bad disk.  It could have also been a transient type issue 
or one mridul mentioned with resources.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #15249: [SPARK-17675] [CORE] Expand Blacklist for TaskSets

Reply via email to