Tim Armstrong created IMPALA-7305:
-------------------------------------
Summary: membership entry for failed impalad gets stuck in
statestore due to race between failure detection and update processing
Key: IMPALA-7305
URL: https://issues.apache.org/jira/browse/IMPALA-7305
Project: IMPALA
Issue Type: Bug
Components: Distributed Exec
Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0,
Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
Reporter: Tim Armstrong
Assignee: Tim Armstrong
Attachments: 0001-Repro-CDH-70703.patch
I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with
the attached patch that adds a sleep. The patch is a hack and only works on my
system (it has a name hardcoded). The trick is to kill the third impala
manually while the cluster is starting up.
Then the system gets stuck in a state where all impalads thing 22002 is alive
but the process was actually killed. Running queries fails because they keep
getting scheduled on the dead impalad.
{noformat}
Known backend(s): 3
Address Coordinator Executor
tarmstrong-box:22002 true true
tarmstrong-box:22001 true true
tarmstrong-box:22000 true true
{noformat}
The race seems quite exotic but may be possible if there are intermittent
transport errors (causing heartbeats to fail) or if there are delays processing
topics, e.g. contending for locks.
IMPALA-4953 fixes the problem by deleting newly-added transient entries if the
subscriber got unregistered while the statestore was processing an update.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]