Github user squito commented on the issue:

    https://github.com/apache/spark/pull/22288
  
    Ok I looked at jiras, and this looks it also covers SPARK-15815, right? you 
could add that to the summary too. 
    
    You mention some future improvements:
    > Taking into account static allocation
    
    I mentioned this on an inline comment too, but now that I'm thinking about, 
it seems like this will be fine with static allocation as well.  It just seems 
like the problem is the worst in DA, as you can end up with one executor left 
for the straggler task, and then that executor gets blacklisted.  But, with 
static allocation, maybe you only requested a small number of executors on a 
large cluster, and by chance you get them all on a host with bad disks, so then 
everything starts failing.  You could still just kill those executors and 
request new ones to keep things going.  Anything I'm missing?
    
    > Querying the RM to figure out if its a small cluster, then try to wait 
some more or abort immediately.
    
    what's the concern here -- that if you're on a small cluster, there is very 
little chance of getting a good replacement so you should go back to failing 
fast?  I guess that would be nice, but much less important in my opinion.
    
    > Try to distinguish between waiting for time while you acquire an executor 
and time for being unable to schedule a task.
    
    I don't understand this part -- do you mean for locality preferences?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to