dirrao opened a new issue, #31957:
URL: https://github.com/apache/airflow/issues/31957

   ### Description
   
   We have a scheduler house keeping work (adopt_or_reset_orphaned_tasks, 
check_trigger_timeouts, _emit_pool_metrics, _find_zombies, 
clear_not_launched_queued_tasks and _check_worker_pods_pending_timeout) runs on 
certain frequency. Right now, we don't have any latency metrics on these house 
keeping work. These will impact the scheduler heartbeat. Its good idea to 
capture these latency metrics to identify and tune the airflow configuration
   
   ### Use case/motivation
   
   As we run the airflow at a large scale, we have found that the 
adopt_or_reset_orphaned_tasks and clear_not_launched_queued_tasks functions 
might take time in a few minutes (> 5 minutes). These will delay the heartbeat 
of the scheduler and leads to the scheduler instance restarting/killed. In 
order to detect these latency issues, we need better metrics to capture these 
latencies.
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to