[
https://issues.apache.org/jira/browse/IMPALA-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838117#comment-16838117
]
ASF subversion and git services commented on IMPALA-7665:
---------------------------------------------------------
Commit e4352aa63f2bdfb0f9e82f8b04567fa6b729af95 in impala's branch
refs/heads/master from Bikramjeet Vig
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=e4352aa ]
IMPALA-7665: Fix unwarranted query cancellation on statestore restart
Currently, if the statestore restarts and disseminates an inconsistent
view of cluster membership to the coordinators, then they might believe
that the backends no longer in the membership update are down and would
start canceling queries that are running or scheduled to run on those
allegedly failed backends. This patch adds a grace period after
statestore recovery/successful registration that give it enough time
to gather a consistent state of the cluster.
Testing:
- Added an e2e test.
- Did manual stress testing using concurrent_select.py with
statestore_subscriber_timeout_seconds set to 2 secs and
failed_backends_query_cancellation_grace_period_ms set to 5 seconds,
and the statestore being restarted every 15 seconds. To avoid other
effects of statestore restarts cropping up, I used a local catalog
(catalog v2) and ignored query errors caused due to scheduler having
an incomplete view of the cluster(no backends).
Change-Id: I30b68bd8bde4bf589d58d42d6f683afb166de959
Reviewed-on: http://gerrit.cloudera.org:8080/13061
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Bringing up stopped statestore causes queries to fail
> -----------------------------------------------------
>
> Key: IMPALA-7665
> URL: https://issues.apache.org/jira/browse/IMPALA-7665
> Project: IMPALA
> Issue Type: Bug
> Components: Distributed Exec
> Affects Versions: Impala 3.1.0
> Reporter: Tim Armstrong
> Assignee: Bikramjeet Vig
> Priority: Critical
> Labels: query-lifecycle, statestore
>
> I can reproduce this by running a long-running query then cycling the
> statestore:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ impala-shell.sh -q
> "select distinct * from tpch10_parquet.lineitem"
> Starting Impala Shell without Kerberos authentication
> Connected to localhost:21000
> Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build
> c486fb9ea4330e1008fa9b7ceaa60492e43ee120)
> Query: select distinct * from tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 17:06:48 (Coordinator:
> http://tarmstrong-box:25000)
> {noformat}
> If I kill the statestore, the query runs fine, but if I start up the
> statestore again, it fails.
> {noformat}
> # In one terminal, start up the statestore
> $
> /home/tarmstrong/Impala/incubator-impala/be/build/latest/statestore/statestored
> -log_filename=statestored
> -log_dir=/home/tarmstrong/Impala/incubator-impala/logs/cluster -v=1
> -logbufsecs=5 -max_log_files=10
> # The running query then fails
> WARNINGS: Failed due to unreachable impalad(s): tarmstrong-box:22001,
> tarmstrong-box:22002
> {noformat}
> Note that I've seen different subsets impalads reported as failed, e.g.
> "Failed due to unreachable impalad(s): tarmstrong-box:22001"
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]