[
https://issues.apache.org/jira/browse/SPARK-56952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated SPARK-56952:
-----------------------------------
Labels: pull-request-available (was: )
> Preserve executor heartbeat timeout loss reason when executor removal is
> reported as ExecutorKilled
> ---------------------------------------------------------------------------------------------------
>
> Key: SPARK-56952
> URL: https://issues.apache.org/jira/browse/SPARK-56952
> Project: Spark
> Issue Type: Improvement
> Components: Kubernetes
> Affects Versions: 5.0.0
> Reporter: Chao Sun
> Priority: Major
> Labels: pull-request-available
>
> When Spark expires an executor due to heartbeat timeout, `HeartbeatReceiver`
> creates a specific loss reason:
> {code:java}
> ExecutorProcessLost("Executor heartbeat timed out ...")
> {code}
> However, for coarse-grained backends, the executor removal path can later
> report the executor as `ExecutorKilled`. In that case, the more specific
> heartbeat-timeout reason is lost and Spark surfaces only the generic backend
> reason.
> This loses useful failure context and can cause downstream handling or
> diagnostics to treat a heartbeat-timeout removal differently from the
> original driver-side failure condition.
> The issue is especially visible in flows where Spark requests executor
> replacement after heartbeat expiry, while the backend later confirms the
> removal with a generic `ExecutorKilled` reason.
> We should preserve the original heartbeat-timeout loss reason across the
> kill-and-remove flow when the backend provides only `ExecutorKilled`, while
> still respecting any concrete backend-provided loss reason when one exists.
> Proposed behavior:
> - Carry the heartbeat-timeout `ExecutorProcessLost` reason through executor
> replacement.
> - Use it only when the backend reports generic `ExecutorKilled`.
> - Do not override more specific backend reasons such as `ExecutorExited`.
> - Clear any pending preserved loss reason if the kill request is rejected or
> fails.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]