On Fri, Mar 29, 2019 at 6:18 PM Reynold Xin <r...@databricks.com> wrote:

> We tried enabling blacklisting for some customers and in the cloud, very
> quickly they end up having 0 executors due to various transient errors. So
> unfortunately I think the current implementation is terrible for cloud
> deployments, and shouldn't be on by default. The heart of the issue is that
> the current implementation is not great at dealing with transient errors vs
> catastrophic errors.
>

+1.

It contains the assumption that "Blacklisting is the solution", when really
"reporting to something which can opt to destroy the node and request a new
one" is better

Having some way to report those failures to a monitor like that can combine
app-level failure detection with cloud-infra reaction.

it's also interesting to look at more complex failure evalators, where
the Φ Accrual Failure Detector is an interesting option

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.80.7427&rep=rep1&type=pdf

Apparently you can use this with Akka:
https://manuel.bernhardt.io/2017/07/26/a-new-adaptive-accrual-failure-detector-for-akka/

again, making this something where people can experiment with algorithms is
a nice way to let interested parties explore the options in different
environments

Reply via email to