peay commented on issue #16573:
URL: https://github.com/apache/airflow/issues/16573#issuecomment-887467242
After more investigation, I was able to get to the bottom of it. I had
missed the actual SQLAlchemy queries updating the task instance/task fail, as
they are not performed by either the scheduler, web server or worker, but by
the DAG processor (as part of file-level callbacks), whose logs are only
available locally in my setup.
In my case, the scheduler thinks the job is a zombie. I had also missed that
as the log message is at `INFO` level, and does not include the task ID nor the
dag ID. Maybe the representation of the task instance should be used to provide
more context and make searching logs easier?
```
[2021-07-27 09:54:44,249] {dag_processing.py:1156} INFO
Detected zombie job:
{'full_filepath': '/opt/dags/dags.py', 'msg': 'Detected as zombie',
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object
at 0x7f7362fe5510>, 'is_failure_callback': True}
```
The scheduler is marking the job as `failed` in
[`adopt_or_reset_orphaned_tasks`](https://github.com/apache/airflow/blob/2.1.1/airflow/jobs/scheduler_job.py#L1831),
which marks as failed all jobs that have not sent a heartbeat in the last
`scheduler_health_check_threshold=30s`.
This is apparently caused by a combination of two facts:
- I had set `job_heartbeat_sec` to 30s, to avoid too much pressure on the
database as my jobs are long.
- **Whenever the job heartbeats, it sets `latest_heartbeat` to 30s in the
past, as shown in the following database logs.** I am not sure if this is on
purpose or a bug, but it certainly looks suspicious.
```sql
2021-07-27 09:53:21 UTC UPDATE job SET
latest_heartbeat='2021-07-27T09:52:51.398617+00:00'::timestamptz WHERE job.id =
4451
2021-07-27 09:53:51 UTC UPDATE job SET
latest_heartbeat='2021-07-27T09:53:21.480933+00:00'::timestamptz WHERE job.id =
4451
2021-07-27 09:54:21 UTC UPDATE job SET
latest_heartbeat='2021-07-27T09:53:51.566774+00:00'::timestamptz WHERE job.id =
4451
```
In this example, `adopt_or_reset_orphaned_tasks` then ran 15s after the last
heartbeat and did:
```sql
2021-07-27 09:54:36 UTC UPDATE job SET state='failed' WHERE job.state =
'running' AND job.latest_heartbeat <
'2021-07-27T09:54:06.555688+00:00'::timestamptz
```
From my limited understanding, there seem to be two issues:
- `latest_heartbeat` is set to 30s in the past instead of the current time
point, as I would expect.
- Although setting `job_heartbeat_sec=30s` makes things worse by increasing
the odds of this occurring, it seems like this can occur as soon as
`job_heartbeat_sec > 15s`. Generally, it seems odd that `job_heartbeat_sec` is
not used in `adopt_or_reset_orphaned_tasks` instead of
`scheduler_health_check_threshold`. In particular, if `job_heartbeat_sec` is
much larger than `scheduler_health_check_threshold`, then won't
`adopt_or_reset_orphaned_tasks` fail most jobs?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]