[
https://issues.apache.org/jira/browse/IMPALA-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766402#comment-16766402
]
Tim Armstrong commented on IMPALA-7665:
---------------------------------------
* Should we add a mechanism to allow detecting that it's a new statestore? E.g.
the statestore generates a unique ID each time it starts up.
* Is there a reason that WAIT_TIME needs to match up with the time taken for
the statestore to detect failure? 30s seems like a safe amount of time to
expect a healthy impalad to re-register in but would it need to be increased if
we increased the heartbeat interval?
The scenario where a statestore and impalad go down and up at the same time
should be rare, and is related to an existing problem with fast crash/restarts
- IMPALA-414. That's unless we're doing a non-graceful rolling restart of the
statestore and impalads at the same time. Maybe that's an argument against
restarting those services at the same time.
> Bringing up stopped statestore causes queries to fail
> -----------------------------------------------------
>
> Key: IMPALA-7665
> URL: https://issues.apache.org/jira/browse/IMPALA-7665
> Project: IMPALA
> Issue Type: Bug
> Components: Distributed Exec
> Affects Versions: Impala 3.1.0
> Reporter: Tim Armstrong
> Priority: Critical
> Labels: query-lifecycle, statestore
>
> I can reproduce this by running a long-running query then cycling the
> statestore:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ impala-shell.sh -q
> "select distinct * from tpch10_parquet.lineitem"
> Starting Impala Shell without Kerberos authentication
> Connected to localhost:21000
> Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build
> c486fb9ea4330e1008fa9b7ceaa60492e43ee120)
> Query: select distinct * from tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 17:06:48 (Coordinator:
> http://tarmstrong-box:25000)
> {noformat}
> If I kill the statestore, the query runs fine, but if I start up the
> statestore again, it fails.
> {noformat}
> # In one terminal, start up the statestore
> $
> /home/tarmstrong/Impala/incubator-impala/be/build/latest/statestore/statestored
> -log_filename=statestored
> -log_dir=/home/tarmstrong/Impala/incubator-impala/logs/cluster -v=1
> -logbufsecs=5 -max_log_files=10
> # The running query then fails
> WARNINGS: Failed due to unreachable impalad(s): tarmstrong-box:22001,
> tarmstrong-box:22002
> {noformat}
> Note that I've seen different subsets impalads reported as failed, e.g.
> "Failed due to unreachable impalad(s): tarmstrong-box:22001"
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]