[ 
https://issues.apache.org/jira/browse/IMPALA-7665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16766372#comment-16766372
 ] 

Bikramjeet Vig commented on IMPALA-7665:
----------------------------------------

A possible solution:

Use the same time interval in impala as the StateStore(SS) uses for missed 
heartbeats to declare an impalaD as down. The series of events will be as 
follows:

- impaladD loses connection to SS and enters recovery mode
- it continues to function based on what stale data it has
- when it establishes connection back with SS, it does not know if SS just lost 
connection or it was restarted, so it follows the same procedure for recovery 
as it does.
- it notices that impalaD_2 is missing from the membership update
- it waits for the same time as SS waits on failed hearbeats from an impalaD 
before declaring it dead, lets call that time frame WAIT_TIME
- After WAIT_TIME has elapse:
  => if impalaD_2 re-appears in the membership update, everything continues 
unhindered
  => if impalaD_2 does not appear, it starts up the cancellation process of all 
queries currently running or scheduled to run on impalaD_2
- What happends if impalaD_2 restarted while the SS was down and appeared in 
the membership update when SS came back up within WAIT_TIME??
  => impalaD would assume that impalaD_2 never went down and would keep working 
noramlly.
  => At this point we need IMPALA-2990 for the queries initially running on 
impalaD_2 to realize that it is unresponsive (since it restarted and the 
fragments are no longer running on it) and cancel themselves

> Bringing up stopped statestore causes queries to fail
> -----------------------------------------------------
>
>                 Key: IMPALA-7665
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7665
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Distributed Exec
>    Affects Versions: Impala 3.1.0
>            Reporter: Tim Armstrong
>            Priority: Critical
>              Labels: query-lifecycle, statestore
>
> I can reproduce this by running a long-running query then cycling the 
> statestore:
> {noformat}
> tarmstrong@tarmstrong-box:~/Impala/incubator-impala$ impala-shell.sh -q 
> "select distinct * from tpch10_parquet.lineitem"
> Starting Impala Shell without Kerberos authentication
> Connected to localhost:21000
> Server version: impalad version 3.1.0-SNAPSHOT DEBUG (build 
> c486fb9ea4330e1008fa9b7ceaa60492e43ee120)
> Query: select distinct * from tpch10_parquet.lineitem
> Query submitted at: 2018-10-04 17:06:48 (Coordinator: 
> http://tarmstrong-box:25000)
> {noformat}
> If I kill the statestore, the query runs fine, but if I start up the 
> statestore again, it fails.
> {noformat}
> # In one terminal, start up the statestore
> $ 
> /home/tarmstrong/Impala/incubator-impala/be/build/latest/statestore/statestored
>  -log_filename=statestored 
> -log_dir=/home/tarmstrong/Impala/incubator-impala/logs/cluster -v=1 
> -logbufsecs=5 -max_log_files=10
> # The running query then fails
> WARNINGS: Failed due to unreachable impalad(s): tarmstrong-box:22001, 
> tarmstrong-box:22002
> {noformat}
> Note that I've seen different subsets impalads reported as failed, e.g. 
> "Failed due to unreachable impalad(s): tarmstrong-box:22001"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to