[ 
https://issues.apache.org/jira/browse/IMPALA-10476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenzhe Zhou updated IMPALA-10476:
---------------------------------
    Description: 
If an executor node frequently gets disk IO failures when reading/writing local 
disk, it should report its unhealthy state to statestore so that the node could 
be marked as down and be removed from executor group to avoid repeated query 
failures in the cluster. This provides a mechanism for executor node to remove 
itself from scheduling.

The two major components of Impala that read/write from local disk are the 
spill-to-disk and data caching features. We need to add stats for counting such 
local disk failures over a period of time like last x seconds, then use these 
stats to measure if a node is in good health for executing query fragment 
instances.   

The healthy state of an executor node should be shown on the debug WebUI. We 
should also allow users to overwrite the node's healthy state. The node will 
restart to register itself in the statestore once its healthy state is 
overwritten.

  was:
If an executor repeatedly get disk IO failures when read/write local disk, it 
should report its unhealthy state to statestore so that we could mark the node 
as down and remove it from executor group to avoid repeated query failures in 
the cluster. This provide a mechanism for executor node to remove itself from 
scheduling.

The two main components of Impala that read / write from local disk are the 
spill-to-disk and data caching features. We need to to add stats to count local 
disk failures.

The node healthy state should be shown on the debug WebUI. We also should allow 
user to overwrite the node healthy state.

 

 


> Remove executor node with faulty disks from executor group
> ----------------------------------------------------------
>
>                 Key: IMPALA-10476
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10476
>             Project: IMPALA
>          Issue Type: Sub-task
>          Components: Distributed Exec
>            Reporter: Wenzhe Zhou
>            Assignee: Wenzhe Zhou
>            Priority: Major
>
> If an executor node frequently gets disk IO failures when reading/writing 
> local disk, it should report its unhealthy state to statestore so that the 
> node could be marked as down and be removed from executor group to avoid 
> repeated query failures in the cluster. This provides a mechanism for 
> executor node to remove itself from scheduling.
> The two major components of Impala that read/write from local disk are the 
> spill-to-disk and data caching features. We need to add stats for counting 
> such local disk failures over a period of time like last x seconds, then use 
> these stats to measure if a node is in good health for executing query 
> fragment instances.   
> The healthy state of an executor node should be shown on the debug WebUI. We 
> should also allow users to overwrite the node's healthy state. The node will 
> restart to register itself in the statestore once its healthy state is 
> overwritten.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to