[jira] [Comment Edited] (IMPALA-12382) Coordinator could schedule fragments on gracefully shutdown executors

Wenzhe Zhou (Jira) Fri, 25 Aug 2023 10:54:31 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-12382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17758687#comment-17758687
 ]


Wenzhe Zhou edited comment on IMPALA-12382 at 8/25/23 5:48 PM:
---------------------------------------------------------------

If the executor is removed from the cluster membership by statestore when 
receiving un-registering request, it could affect running queries. Coordinators 
cancel the queries which are running on failed executors (as evidenced by their 
absence from the membership list). See 
[ImpalaServer::CancelQueriesOnFailedBackends()|https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L2365-L2375].

It seems we already have 
[mechanism|https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L124-L126]
 to avoid scheduling new task on the executors which are shutting down by 
marking the executor in "quiescing" state.  BackendDescriptorPB of the 
executors in "quiescing" state are propagated to coordinators in cluster 
membership updates, and those executors in "quiescing" state will[ be removed 
from executor 
groups|https://github.com/apache/impala/blob/master/be/src/scheduling/cluster-membership-mgr.cc#L312-L329]
 so that coordinators would not schedule new fragments to these executors. 

[~arawat] Do you see log messages like "Removing backend xxx from group yyy 
(quiescing)" on coordinator ?


was (Author: wzhou):
If the executor is removed from the cluster membership by statestore when 
receiving un-registering request, it could affect running queries. Coordinators 
cancel the queries which are running on failed executors (as evidenced by their 
absence from the membership list). See 
[ImpalaServer::CancelQueriesOnFailedBackends()|https://github.com/apache/impala/blob/master/be/src/service/impala-server.cc#L2365-L2375].

It seems we already have 
[mechanism|https://github.com/apache/impala/blob/master/be/src/service/impala-server.h#L124-L126]
 to avoid scheduling new task on the executors which are shutting down by 
marking the executor in "quiescing" state. 

> Coordinator could schedule fragments on gracefully shutdown executors
> ---------------------------------------------------------------------
>
>                 Key: IMPALA-12382
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12382
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Abhishek Rawat
>            Assignee: Wenzhe Zhou
>            Priority: Critical
>
> Statestore does failure detection based on consecutive heartbeat failures. 
> This is by default configured to be 10 (statestore_max_missed_heartbeats) at 
> 1 second intervals (statestore_heartbeat_frequency_ms). This could however 
> take much longer than 10 seconds overall, especially if statestore is busy 
> and due to rpc timeout duration.
> In the following example it took 50 seconds for failure detection:
> {code:java}
> I0817 12:32:06.824721    86 statestore.cc:1157] Unable to send heartbeat 
> message to subscriber 
> impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
>  received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
> exception: No more data to read., type: 
> N6apache6thrift9transport19TTransportExceptionE, rpc: 
> N6impala18THeartbeatResponseE, send: done
> I0817 12:32:06.824741    86 failure-detector.cc:91] 1 consecutive heartbeats 
> failed for 
> 'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
>  State is OK
> .....
> .....
> .....
> I0817 12:32:56.800251    83 statestore.cc:1157] Unable to send heartbeat 
> message to subscriber 
> impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010,
>  received error: RPC Error: Client for 10.80.199.159:23000 hit an unexpected 
> exception: No more data to read., type: 
> N6apache6thrift9transport19TTransportExceptionE, rpc: 
> N6impala18THeartbeatResponseE, send: done 
> I0817 12:32:56.800267    83 failure-detector.cc:91] 10 consecutive heartbeats 
> failed for 
> 'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'.
>  State is FAILED
> I0817 12:32:56.800276    83 statestore.cc:1168] Subscriber 
> 'impa...@impala-executor-001-5.impala-executor.impala-1692115218-htqx.svc.cluster.local:27010'
>  has failed, disconnected or re-registered (last known registration ID: 
> c84bf70f03acda2b:b34a812c5e96e687){code}
> As a result there is a window when statestore is determining node failure and 
> coordinator might schedule fragments on that particular executor(s). The exec 
> RPC will fail and if transparent query retries is enabled, coordinator will 
> immediately retry the query and it will fail again.
> Ideally in such situations coordinator should be notified sooner about a 
> failed executor. Statestore could send priority topic update to coordinator 
> when it enters failure detection logic. This should reduce the chances of 
> coordinator scheduling query fragment on a failed executor.
> The other argument could be to tune the heartbeat frequency and interval 
> parameters. But, it's hard to find configuration which works for all cases. 
> And, so while the default values are reasonable, under certain conditions 
> they could be unreasonable as seen in the above example.
> It might make sense to especially handle the case where executors are 
> shutdown gracefully and in such case statestore shouldn't do failure 
> detection and instead fail these executor immediately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (IMPALA-12382) Coordinator could schedule fragments on gracefully shutdown executors

Reply via email to