peay commented on issue #16573:
URL: https://github.com/apache/airflow/issues/16573#issuecomment-887467242


   After more investigation, I was able to get to the bottom of it. I had 
missed the actual SQLAlchemy queries updating the task instance/task fail, as 
they are not performed by either the scheduler, web server or worker, but by 
the DAG processor (as part of file-level callbacks), whose logs are only 
available locally in my setup.
   
   In my case, the scheduler thinks the job is a zombie. I had also missed that 
as the log message is at `INFO` level, and does not include the task ID nor the 
dag ID. Maybe the representation of the task instance should be used to provide 
more context and make searching logs easier?
   
   ```
   [2021-07-27 09:54:44,249] {dag_processing.py:1156} INFO
      Detected zombie job:
      {'full_filepath': '/opt/dags/dags.py', 'msg': 'Detected as zombie', 
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object 
at 0x7f7362fe5510>, 'is_failure_callback': True}
   ```
   
   The scheduler is marking the job as `failed` in 
[`adopt_or_reset_orphaned_tasks`](https://github.com/apache/airflow/blob/2.1.1/airflow/jobs/scheduler_job.py#L1831),
 which marks as failed all jobs that have not sent a heartbeat in the last 
`scheduler_health_check_threshold=30s`.
   
   This is apparently caused by a combination of two facts:
   
   - I had set `job_heartbeat_sec` to 30s, to avoid too much pressure on the 
database as my jobs are long.
   
   - **Whenever the job heartbeats, it sets `latest_heartbeat` to 30s in the 
past, as shown in the following database logs.** I am not sure if this is on 
purpose or a bug, but it certainly looks suspicious.
   
       ```sql
       2021-07-27 09:53:21 UTC  UPDATE job SET 
latest_heartbeat='2021-07-27T09:52:51.398617+00:00'::timestamptz WHERE job.id = 
4451
       2021-07-27 09:53:51 UTC  UPDATE job SET 
latest_heartbeat='2021-07-27T09:53:21.480933+00:00'::timestamptz WHERE job.id = 
4451
       2021-07-27 09:54:21 UTC  UPDATE job SET 
latest_heartbeat='2021-07-27T09:53:51.566774+00:00'::timestamptz WHERE job.id = 
4451
       ```
   
   In this example, `adopt_or_reset_orphaned_tasks` then ran 15s after the last 
heartbeat and did:
   
   ```sql
   2021-07-27 09:54:36 UTC  UPDATE job SET state='failed' WHERE job.state = 
'running' AND job.latest_heartbeat < 
'2021-07-27T09:54:06.555688+00:00'::timestamptz
   ```
   
   From my limited understanding, there seem to be two issues:
   
   - `latest_heartbeat` is set to 30s in the past instead of the current time 
point, as I would expect.
   
   - Although setting `job_heartbeat_sec=30s` makes things worse by increasing 
the odds of this occurring, it seems like this can occur as soon as 
`job_heartbeat_sec > 15s`. Generally, it seems odd that `job_heartbeat_sec` is 
not used in `adopt_or_reset_orphaned_tasks` instead of 
`scheduler_health_check_threshold`. In particular, if `job_heartbeat_sec` is 
much larger than `scheduler_health_check_threshold`, then won't 
`adopt_or_reset_orphaned_tasks` fail most jobs?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to