dirrao opened a new pull request, #34985:
URL: https://github.com/apache/airflow/pull/34985

   Problem: Airflow running the clear_not_launched_queued_tasks function on a 
certain frequency (default 30 seconds). When we run the airflow on a large Kube 
cluster (pods more than > 5K). Internally the clear_not_launched_queued_tasks 
function loops through each queued task and checks the corresponding worker pod 
existence in the Kube cluster. Right this existence check using list pods Kube 
API. The API is taking more than 1s. if there are 120 queued tasks, then it 
will take ~ 120 seconds (1s * 120). So, this leads the scheduler to spend most 
of its time in this function rather than scheduling the tasks. It leads to none 
of the jobs being scheduled or degraded scheduler performance.
   
   Solution: Use k8 list pods batch api call to get all the worker pod owned by 
scheduler. Prepare the set of searchable strings using pod labels. Use this set 
data structure and identify whether the task associated pod exists or not.
   
   set elements string format:
   
(dag_id=<dag_id>,task_id=<task_id>,airflow-worker=[,map_index=<map_index>],[run_id=<run_id>]|[execution_date=<execution_date>])
   
   The details for the issue is mentioned in the below ticket.
   https://github.com/apache/airflow/issues/34877


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to