[ 
https://issues.apache.org/jira/browse/IMPALA-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16545764#comment-16545764
 ] 

Tim Armstrong commented on IMPALA-7305:
---------------------------------------

Note that the only workaround is to restart the statestore.

> membership entry for failed impalad gets stuck in statestore due to race 
> between failure detection and update processing
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7305
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7305
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, 
> Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Major
>             Fix For: Impala 2.12.0, Impala 3.1.0
>
>         Attachments: 0001-Repro-CDH-70703.patch
>
>
> I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with 
> the attached patch that adds a sleep. The patch is a hack and only works on 
> my system (it has a name hardcoded). The trick is to kill the third impala 
> manually while the cluster is starting up.
> Then the system gets stuck in a state where all impalads thing 22002 is alive 
> but the process was actually killed. Running queries fails because they keep 
> getting scheduled on the dead impalad.
> {noformat}
> Known backend(s): 3
> Address       Coordinator     Executor
> tarmstrong-box:22002  true    true
> tarmstrong-box:22001  true    true
> tarmstrong-box:22000  true    true
> {noformat}
> The race seems quite exotic but may be possible if there are intermittent 
> transport errors (causing heartbeats to fail) or if there are delays 
> processing topics, e.g. contending for locks.
> IMPALA-4953 fixes the problem by deleting newly-added transient entries if 
> the subscriber got unregistered while the statestore was processing an update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to