LeandroDoldan edited a comment on issue #18041:
URL: https://github.com/apache/airflow/issues/18041#issuecomment-949672512


   Okay, we solved it.
   
   1. The CPU usage on the database was at 100%.
   2. We changed the variable `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` to 20 sec.
   3. The CPU usage decreased to 50%.
   4. We updated Airflow to `2.1.4`.
   5. It seems like the bug got fixed, because the CPU usage decreased to 3%.
   6. All our tasks started dying by being marked as zombies. We don't 
understand why.
   7. We got `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` back to 5 seconds.
   8. Everything is working perfectly, and we are happily keeping our jobs 🙃 
   
   We would really like to understand why 
`AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` at 20 seconds broke everything.
   
   We have `AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD` set as 1800.
   
   Our train of thought is: running a heartbeat every 20 seconds, during the 
zombie threshold there should be about 90 heartbeats. And `limit_dttm` _is_ 
half an hour earlier. So the only reasonable explanation (to us) is that none 
of the heartbeats update `LJ.latest_heartbeat`. (I'm referring to the 
`_find_zombies` method of the `manager.py` file) 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to