LeandroDoldan edited a comment on issue #18041: URL: https://github.com/apache/airflow/issues/18041#issuecomment-949672512
Okay, we solved it. 1. The CPU usage on the database was at 100%. 2. We changed the variable `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` to 20 sec. 3. The CPU usage decreased to 50%. 4. We updated Airflow to `2.1.4`. 5. It seems like the bug got fixed, because the CPU usage decreased to 3%. 6. All our tasks started dying by being marked as zombies. We don't understand why. 7. We got `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` back to 5 seconds. 8. Everything is working perfectly, and we are happily keeping our jobs 🙃 We would really like to understand why `AIRFLOW__SCHEDULER__JOB_HEARTBEAT_SEC` at 20 seconds broke everything. We have `AIRFLOW__SCHEDULER__SCHEDULER_ZOMBIE_TASK_THRESHOLD` set as 1800. Our train of thought is: running a heartbeat every 20 seconds, during the zombie threshold there should be about 90 heartbeats. And `limit_dttm` _is_ half an hour earlier. So the only reasonable explanation (to us) is that none of the heartbeats update `LJ.latest_heartbeat`. (I'm referring to the `_find_zombies` method of the `manager.py` file) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
