Github user kayousterhout commented on the pull request:

    https://github.com/apache/spark/pull/8760#issuecomment-161070513
  
    Before fixing style issues on this change, I think it's worth considering 
whether this is the right approach for blacklisting.  Based on discussion here 
and in #3541 and #10045, it seems like there are a few high-level goals here:
    
    (1) Minimize the required user configuration (ideally this should just work 
without any configuration)
    (2) If an executor/host isn't working well, stop using it for the 
particular job
    (3) If an executor/host is failing for multiple jobs, stop using it across 
all jobs
    (4) Eventually re-try the executor/host, since the failure may be transient
    (5) Don't overcorrect for bad tasks / jobs (that fail regardless of whether 
they're run)
    
    My concern with the approach in this PR is that it requires a lot of user 
configuration, and as has been discussed in the various PRs, the appropriate 
blacklist timeout is very workload-dependent and type-of-failure-dependent.  I 
wonder if it would make sense to do something like an exponentially increasing 
timeout (where each consecutive failure triggers a longer timeout) to make this 
have a lower configuration overhead.  I pinged @mateiz to see if he has any 
other ideas about how to do this gracefully.
    
    The other issue is (5). One way to handle that is to only "count" a task 
failure if the task fails on the executor *and* succeeds elsewhere.
    
    It would be great if we could make blacklisting work well out-of-the-box, 
so I think it's worth putting some thought into the right approach here.  It 
would be useful to get others folks' feedback about whether these are the right 
goals and if there are better ideas for how to achieve them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to