[jira] [Commented] (IMPALA-7305) membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing

ASF subversion and git services (JIRA) Mon, 24 Sep 2018 21:55:09 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16626769#comment-16626769
 ]


ASF subversion and git services commented on IMPALA-7305:
---------------------------------------------------------

Commit e38715e25297cc3643482be04e3b1b273e339b54 in impala's branch 
refs/heads/master from [[email protected]]
[ https://git-wip-us.apache.org/repos/asf?p=impala.git;h=e38715e ]

IMPALA-7306: regression test for non-removed transient updates

Adds a test for IMPALA-7305 that reproduces the bug by delaying
heartbeats and updates.

Increased some timeouts in the test because they were hit
once after looping for ~12 hours.

Testing:
Manually reintroduced the bug by commenting out the code that
fixed it and confirmed that the test failed.

Change-Id: I6c2a39d8a76cb5371f394b5a97817d8231e473cc
Reviewed-on: http://gerrit.cloudera.org:8080/11470
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> membership entry for failed impalad gets stuck in statestore due to race 
> between failure detection and update processing
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IMPALA-7305
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7305
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 2.5.0, Impala 2.6.0, Impala 2.7.0, Impala 2.8.0, 
> Impala 2.9.0, Impala 2.10.0, Impala 2.11.0
>            Reporter: Tim Armstrong
>            Assignee: Tim Armstrong
>            Priority: Critical
>             Fix For: Impala 2.12.0, Impala 3.1.0
>
>         Attachments: 0001-Repro-CDH-70703.patch
>
>
> I was able to reproduce this bug on a version of Impala pre-IMPALA-4953 with 
> the attached patch that adds a sleep. The patch is a hack and only works on 
> my system (it has a name hardcoded). The trick is to kill the third impala 
> manually while the cluster is starting up.
> Then the system gets stuck in a state where all impalads thing 22002 is alive 
> but the process was actually killed. Running queries fails because they keep 
> getting scheduled on the dead impalad.
> {noformat}
> Known backend(s): 3
> Address       Coordinator     Executor
> tarmstrong-box:22002  true    true
> tarmstrong-box:22001  true    true
> tarmstrong-box:22000  true    true
> {noformat}
> The race seems quite exotic but may be possible if there are intermittent 
> transport errors (causing heartbeats to fail) or if there are delays 
> processing topics, e.g. contending for locks.
> IMPALA-4953 fixes the problem by deleting newly-added transient entries if 
> the subscriber got unregistered while the statestore was processing an update.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-7305) membership entry for failed impalad gets stuck in statestore due to race between failure detection and update processing

Reply via email to