Github user mridulm commented on the pull request:

    https://github.com/apache/spark/pull/3541#issuecomment-65309912
  
    Note: I am ignoring deterministic failure reasons here (which will fail on 
any host and usually points to bug in user or spark codebase).
    Task failure could be due to a variety of transient reasons - which could 
be directly related to the task in question, indirectly related to it, or even 
completely unrelated to it.
    For example: 
    - What other tasks are running on the executor and how it interacts with 
the failed task.
    - What data is currently cached on the executor and the impact it has on 
resource utilization (rdd, broadcast, buffers, gc, etc).
    - What the current state of the executor is (in process of shutdown, but 
not yet informed the driver about it).
    ... among others.
    
    Also note that when you have resource constraints enforced (particularly in 
yarn - where memory limits are aggressively enforced) - one or more of the 
above can interact with that to cause further non deterministic failures : 
which is why we have more hacks like memory overheads to help alleviate (though 
not eliminate) them.
    
    Since we have limits on number of times an application can have executors 
failures, number of times a task can fail before the application is failed, etc 
- we need executor level blacklist.
    Note, this does not mean we do not need host level blacklist ! I can 
definitely see value in that if the issues above are host level - as pointed 
out, lack of hdd space, bad memory or cpu, thermal issues, etc.
    
    Ideally, as I mentioned in the past, we need a better way to identify and 
blacklist executors/hosts/racks.
    What we currently have is a stop gap hack - and upgrading that from 
executor level to host level does not solve problems (it causes regressions 
actually in our workloads - since we are not missing a replica completely for 
dfs data and the other replicas might not be in our allocated hosts/executors).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to