[ 
https://issues.apache.org/jira/browse/IMPALA-9425?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17051747#comment-17051747
 ] 

ASF subversion and git services commented on IMPALA-9425:
---------------------------------------------------------

Commit ae0bd674a86f2d7deb4f72a7544fe5f0950ded0b in impala's branch 
refs/heads/master from Thomas Tauber-Marshall
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=ae0bd67 ]

IMPALA-9425 (part 1): Introduce uuids for impalads

This patch introduces the concept of 'backend ids', which are unique
ids that can be used to identify individual impalads. The ids are
generated by each impalad on startup.

The patch then uses the ids to fix a bug where the statestore may fail
to inform coordinators when an executor has failed and restarted. The
bug was caused by the fact that the statestore cluster membership
topic was keyed on statestore subscriber ids, which are host:port
pairs.

So, if an impalad fails and a new one is started at the same host:port
before a particular coordinator has a cluster membership update
generated for it by the statestore, the statestore has no way of
differentiating the prior impalad from the newly started impalad, and
the topic update will not show the deletion of the original impalad.

With this patch, the cluster membership topic is now keyed by backend
id, so since the restarted impalad will have a different backend id
the next membership update after the prior impalad failed is
guaranteed to reflect that failure.

The patch also logs the backend ids on startup and adds them to the
/backends webui page and to the query locations section of the
/queries page, for use in debugging.

Further patches will apply the backend ids in other places where we
currently key off host:port pairs to identify impalads.

Testing:
- Added an e2e test that uses a new debug action to add delay to
  statestore topic updates. Due to the use of JITTER the test is
  non-deterministic, but it repros the original issue locally for me
  about 50% of the time.
- Passed a full run of existing tests.

Change-Id: Icf8067349ed6b765f6fed830b7140f60738e9061
Reviewed-on: http://gerrit.cloudera.org:8080/15321
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Statestore may fail to report when an impalad has failed
> --------------------------------------------------------
>
>                 Key: IMPALA-9425
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9425
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.4.0
>            Reporter: Thomas Tauber-Marshall
>            Assignee: Thomas Tauber-Marshall
>            Priority: Critical
>
> If an impalad fails and another is restarted at the same host:port 
> combination quickly, the statestore may fail to report to the coordinators 
> that the impalad went down.
> The reason for this is that in the cluster membership topic, impalads are 
> keyed by their statestore subscriber id, which is "impalad@host:port". If the 
> new impalad registers itself before a topic update has been generated for a 
> particular coordinator, the statestore has no way of knowing that the 
> particular key was deleted and then re-added since the last update.
> The result is that queries that were running on the impalad that failed may 
> not be cancelled by the coordinator until they pass the unresponsive backend 
> timeout, which by default is ~12 minutes.
> I propose as a solution that we add a concept of uuids for impalads, where 
> each impalad will generate its own uuid on startup. This allows us to 
> differentiate between different impalads running at the same host:port 
> combination.
> It can also be used to simplify some logic in the scheduler and 
> ExecutorGroup/ExecutorBlacklist etc. where we currently have data structures 
> containing info about impalads that are keyed off host/port combinations.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to