[ 
https://issues.apache.org/jira/browse/IMPALA-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766621#comment-16766621
 ] 

Bikramjeet Vig commented on IMPALA-7665:
----------------------------------------

[~tarmstrong]
- I don't think we need to add a mechanism to detect a new statestore unless we 
want different behaviour when it restarts vs when it simply loses connection.
- I think it should be more than or equal to (statestore_max_missed_heartbeats 
+ 2) * statestore_heartbeat_frequency_ms, this would emulate statestore 
behavior on the impalaD. The 2 extra heartbeats added is for the impalaD_2 to 
connect with the statestore and another for the membership update to propagate 
to impalaD_1. Another alternative would be do this and add a lower bound on 
this WAIT_TIME, like maybe 10s or 30s as you suggested. This would also ensure 
that it scales if any of the tunable parameters change on the statestore side, 
although we'll have to make sure we supply those same startup parameters to the 
impalad which can get complicated if you change it on statestore restart. Yet 
another alternative would be to have a constant wait time of 30s like you 
suggested initially.
- I agree that the scenario where a statestore and impalad go down and up at 
the same time should be rare, but if it does happen, those running queries 
might get stuck indefinitely waiting for fragment updates and we might end up 
restarting the whole service. If we mention that as a limitation since its an 
unlikely case, and are ok with it, then we shouldn't let IMPALA-2990 hold us 
back from implementing this.

> Bringing up stopped statestore causes queries to fail
> -----------------------------------------------------
>
>                 Key: IMPALA-7665
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7665
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.1.0
>            Reporter: Tim Armstrong
>            Priority: Critical
>              Labels: query-lifecycle, statestore
>
> I can reproduce this by running a long-running query then cycling the 
> statestore:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ impala-shell.sh -q 
> "select distinct * from tpch10_parquet.lineitem"
> Starting Impala Shell without Kerberos authentication
> Connected to localhost:21000
> Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build 
> c486fb9ea4330e1008fa9b7ceaa60492e43ee120)
> Query: select distinct * from tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 17:06:48 (Coordinator: 
> http://tarmstrong-box:25000)
> {noformat}
> If I kill the statestore, the query runs fine, but if I start up the 
> statestore again, it fails.
> {noformat}
> # In one terminal, start up the statestore
> $ 
> /home/tarmstrong/Impala/incubator-impala/be/build/latest/statestore/statestored
>  -log_filename=statestored 
> -log_dir=/home/tarmstrong/Impala/incubator-impala/logs/cluster -v=1 
> -logbufsecs=5 -max_log_files=10
> # The running query then fails
> WARNINGS: Failed due to unreachable impalad(s): tarmstrong-box:22001, 
> tarmstrong-box:22002
> {noformat}
> Note that I've seen different subsets impalads reported as failed, e.g. 
> "Failed due to unreachable impalad(s): tarmstrong-box:22001"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to