amr-noureldin opened a new issue #12396:
URL: https://github.com/apache/airflow/issues/12396
<!--
Welcome to Apache Airflow! For a smooth issue process, try to answer the
following questions.
Don't worry if they're not all applicable; just try to include what you can
:-)
If you need to include code snippets or logs, please put them in fenced code
blocks. If they're super-long, please use the details tag like
<details><summary>super-long log</summary> lots of stuff </details>
Please delete these comment blocks before submitting the issue.
-->
<!--
IMPORTANT!!!
PLEASE CHECK "SIMILAR TO X EXISTING ISSUES" OPTION IF VISIBLE
NEXT TO "SUBMIT NEW ISSUE" BUTTON!!!
PLEASE CHECK IF THIS ISSUE HAS BEEN REPORTED PREVIOUSLY USING SEARCH!!!
Please complete the next sections or the issue will be closed.
These questions are the first thing we need to know to understand the
context.
-->
**Apache Airflow version**: 1.10.12
**Kubernetes version (if you are using kubernetes)** (use `kubectl
version`): Server Version: version.Info{Major:"1", Minor:"11+",
GitVersion:"v1.11.0+d4cacc0", GitCommit:"d4cacc0", GitTreeState:"clean",
BuildDate:"2020-07-16T18:50:14Z", GoVersion:"go1.10.8", Compiler:"gc",
Platform:"linux/amd64"}
**Environment**: Airflow, running on top of Kubernetes - RedHat OpenShift
- **Cloud provider or hardware configuration**:
- **OS** (e.g. from /etc/os-release): From the Airflow containers: fedora:28
- **Kernel** (e.g. `uname -a`): Linux airflow-scheduler-1-xzx5j
3.10.0-1127.18.2.el7.x86_64 #1 SMP Mon Jul 20 22:32:16 UTC 2020 x86_64 x86_64
x86_64 GNU/Linux
- **Install tools**:
- **Others**:
**What happened**:
- A running Airflow task concludes successfully:
`[2020-11-17 08:08:30,301] {local_task_job.py:102} INFO - Task exited with
return code 0`
- Scheduler logs, indicates the following few seconds later:
```
[2020-11-17 08:08:41,932] {logging_mixin.py:112} INFO - [2020-11-17
08:08:41,932] {dagbag.py:357} INFO - Marked zombie job <TaskInstance:
raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00
[failed]> as failed
[2020-11-17 08:08:48,889] {logging_mixin.py:112} INFO - [2020-11-17
08:08:48,889] {dagbag.py:357} INFO - Marked zombie job <TaskInstance:
raas_mpad_acc_1.prepare_reprocessing_srr_mid 2020-11-17 08:07:46.496180+00:00
[failed]> as failed
```
- From the Airflow database, I can extract the latest_heartbeat:
```
airflow=# SELECT latest_heartbeat FROM job WHERE id = 12;
latest_heartbeat
-------------------------------
2020-11-17 08:08:30.292813+00
(1 row)
```
- Our airflow.cfg, has the following configuration:
`scheduler_zombie_task_threshold = 300`
- We are experiencing this **randomly** ever since we upgraded to Airflow
1.10.12. Before, we were running 1.10.10 and did not notice such odd behavior.
**What you expected to happen**:
We are surprised that after a task has successfully concluded, zombie
detection identifies it as a zombie and sets it to a failed state. We do not
have evidence of what was the task state before it was identified as a zombie
A plausible scenario (pure speculation):
1. scheduler identifies the list of running tasks
2. task finishes successfully, thus process (expectedly dies)
3. scheduler performs the check on the list of tasks identified by #1 --> it
determines the process is killed (because of #2), thus marks it as a zombie,
and re-sets its status from success to failed/up for retry?
**How to reproduce it**:
- Issue occurs sporadically, thus challenging to deterministically reproduce.
- We encountered it using atleast: SparkSubmitOperator, and PythonOperator
**Anything else we need to know**:
It occurs sporadically. We also saw the following scenarion:
- Task concluded successfully based on the log message and exit code of the
task
- Task was retried afterwards, because scheduler identified it as a zombie
and marked the task as up_for_retry
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]