[
https://issues.apache.org/jira/browse/IMPALA-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766621#comment-16766621
]
Bikramjeet Vig commented on IMPALA-7665:
----------------------------------------
[~tarmstrong]
- I don't think we need to add a mechanism to detect a new statestore unless we
want different behaviour when it restarts vs when it simply loses connection.
- I think it should be more than or equal to (statestore_max_missed_heartbeats
+ 2) * statestore_heartbeat_frequency_ms, this would emulate statestore
behavior on the impalaD. The 2 extra heartbeats added is for the impalaD_2 to
connect with the statestore and another for the membership update to propagate
to impalaD_1. Another alternative would be do this and add a lower bound on
this WAIT_TIME, like maybe 10s or 30s as you suggested. This would also ensure
that it scales if any of the tunable parameters change on the statestore side,
although we'll have to make sure we supply those same startup parameters to the
impalad which can get complicated if you change it on statestore restart. Yet
another alternative would be to have a constant wait time of 30s like you
suggested initially.
- I agree that the scenario where a statestore and impalad go down and up at
the same time should be rare, but if it does happen, those running queries
might get stuck indefinitely waiting for fragment updates and we might end up
restarting the whole service. If we mention that as a limitation since its an
unlikely case, and are ok with it, then we shouldn't let IMPALA-2990 hold us
back from implementing this.
> Bringing up stopped statestore causes queries to fail
> -----------------------------------------------------
>
> Key: IMPALA-7665
> URL: https://issues.apache.org/jira/browse/IMPALA-7665
> Project: IMPALA
> Issue Type: Bug
> Components: Distributed Exec
> Affects Versions: Impala 3.1.0
> Reporter: Tim Armstrong
> Priority: Critical
> Labels: query-lifecycle, statestore
>
> I can reproduce this by running a long-running query then cycling the
> statestore:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ impala-shell.sh -q
> "select distinct * from tpch10_parquet.lineitem"
> Starting Impala Shell without Kerberos authentication
> Connected to localhost:21000
> Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build
> c486fb9ea4330e1008fa9b7ceaa60492e43ee120)
> Query: select distinct * from tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 17:06:48 (Coordinator:
> http://tarmstrong-box:25000)
> {noformat}
> If I kill the statestore, the query runs fine, but if I start up the
> statestore again, it fails.
> {noformat}
> # In one terminal, start up the statestore
> $
> /home/tarmstrong/Impala/incubator-impala/be/build/latest/statestore/statestored
> -log_filename=statestored
> -log_dir=/home/tarmstrong/Impala/incubator-impala/logs/cluster -v=1
> -logbufsecs=5 -max_log_files=10
> # The running query then fails
> WARNINGS: Failed due to unreachable impalad(s): tarmstrong-box:22001,
> tarmstrong-box:22002
> {noformat}
> Note that I've seen different subsets impalads reported as failed, e.g.
> "Failed due to unreachable impalad(s): tarmstrong-box:22001"
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]