Thomas Tauber-Marshall created IMPALA-9425:
----------------------------------------------
Summary: Statestore may fail to report when an impalad has failed
Key: IMPALA-9425
URL: https://issues.apache.org/jira/browse/IMPALA-9425
Project: IMPALA
Issue Type: Bug
Components: Distributed Exec
Affects Versions: Impala 3.4.0
Reporter: Thomas Tauber-Marshall
Assignee: Thomas Tauber-Marshall
If an impalad fails and another is restarted at the same host:port combination
quickly, the statestore may fail to report to the coordinators that the impalad
went down.
The reason for this is that in the cluster membership topic, impalads are keyed
by their statestore subscriber id, which is "impalad@host:port". If the new
impalad registers itself before a topic update has been generated for a
particular coordinator, the statestore has no way of knowing that the
particular key was deleted and then re-added since the last update.
The result is that queries that were running on the impalad that failed may not
be cancelled by the coordinator until they pass the unresponsive backend
timeout, which by default is ~12 minutes.
I propose as a solution that we add a concept of uuids for impalads, where each
impalad will generate its own uuid on startup. This allows us to differentiate
between different impalads running at the same host:port combination.
It can also be used to simplify some logic in the scheduler and
ExecutorGroup/ExecutorBlacklist etc. where we currently have data structures
containing info about impalads that are keyed off host/port combinations.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]