Tim Armstrong created IMPALA-7305:
-------------------------------------

             Summary: membership entry for failed impalad gets stuck in 
statestore due to race between failure detection and update processing
                 Key: IMPALA-7305
                 URL: https://issues.apache.org/jira/browse/IMPALA-7305
             Project: IMPALA
          Issue Type: Bug
          Components: Distributed Exec
    Affects Versions: Impala 2.11.0, Impala 2.10.0, Impala 2.9.0, Impala 2.8.0, 
Impala 2.7.0, Impala 2.6.0, Impala 2.5.0
            Reporter: Tim Armstrong
            Assignee: Tim Armstrong
         Attachments: 0001-Repro-CDH-70703.patch

I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with 
the attached patch that adds a sleep. The patch is a hack and only works on my 
system (it has a name hardcoded). The trick is to kill the third impala 
manually while the cluster is starting up.

Then the system gets stuck in a state where all impalads thing 22002 is alive 
but the process was actually killed. Running queries fails because they keep 
getting scheduled on the dead impalad.
{noformat}
Known backend(s): 3
Address Coordinator     Executor
tarmstrong-box:22002    true    true
tarmstrong-box:22001    true    true
tarmstrong-box:22000    true    true
{noformat}

The race seems quite exotic but may be possible if there are intermittent 
transport errors (causing heartbeats to fail) or if there are delays processing 
topics, e.g. contending for locks.

IMPALA-4953 fixes the problem by deleting newly-added transient entries if the 
subscriber got unregistered while the statestore was processing an update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to