[
https://issues.apache.org/jira/browse/IMPALA-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Wenzhe Zhou updated IMPALA-10476:
---------------------------------
Description:
If an executor node frequently gets disk IO failures when reading/writing local
disk, it should report its unhealthy state to statestore so that the node could
be marked as down and be removed from executor group to avoid repeated query
failures in the cluster. This provides a mechanism for executor node to remove
itself from scheduling.
The two major components of Impala that read/write from local disk are the
spill-to-disk and data caching features. We need to add stats for counting such
local disk failures over a period of time like last x seconds, then use these
stats to measure if a node is in good health for executing query fragment
instances.
The healthy state of an executor node should be shown on the debug WebUI. We
should also allow users to overwrite the node's healthy state. The node will
restart to register itself in the statestore once its healthy state is
overwritten.
was:
If an executor repeatedly get disk IO failures when read/write local disk, it
should report its unhealthy state to statestore so that we could mark the node
as down and remove it from executor group to avoid repeated query failures in
the cluster. This provide a mechanism for executor node to remove itself from
scheduling.
The two main components of Impala that read / write from local disk are the
spill-to-disk and data caching features. We need to to add stats to count local
disk failures.
The node healthy state should be shown on the debug WebUI. We also should allow
user to overwrite the node healthy state.
> Remove executor node with faulty disks from executor group
> ----------------------------------------------------------
>
> Key: IMPALA-10476
> URL: https://issues.apache.org/jira/browse/IMPALA-10476
> Project: IMPALA
> Issue Type: Sub-task
> Components: Distributed Exec
> Reporter: Wenzhe Zhou
> Assignee: Wenzhe Zhou
> Priority: Major
>
> If an executor node frequently gets disk IO failures when reading/writing
> local disk, it should report its unhealthy state to statestore so that the
> node could be marked as down and be removed from executor group to avoid
> repeated query failures in the cluster. This provides a mechanism for
> executor node to remove itself from scheduling.
> The two major components of Impala that read/write from local disk are the
> spill-to-disk and data caching features. We need to add stats for counting
> such local disk failures over a period of time like last x seconds, then use
> these stats to measure if a node is in good health for executing query
> fragment instances.
> The healthy state of an executor node should be shown on the debug WebUI. We
> should also allow users to overwrite the node's healthy state. The node will
> restart to register itself in the statestore once its healthy state is
> overwritten.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]