droppoint commented on issue #32928: URL: https://github.com/apache/airflow/issues/32928#issuecomment-1820413530
Hi, everyone! I think I found the root cause of the problem. Short answer: The [KubernetesExecutor._adopt_completed_pods](https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L645-L676) function is not compatible with concurrently running schedulers. Long answer: I encountered an issue described here after updating my Airflow instance from 2.4.3 to 2.7.3. After the update, the executor.open_slots metric started decreasing, as shown in the picture below.  After restarting the schedulers, the open_slots metric resets, but then it starts to decline again. After some investigation, I discovered two things: 1. My KubernetesExecutor.running set has TaskInstances that were completed a while ago. 2. These TaskInstances should not be in this scheduler because they were completed by my other scheduler. After digging through the logs, I found that pods belonging to those task instances were adopted. The log entry "Attempting to adopt pod" (with a capital "A") corresponds to this [line](https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L664) in the code.  The first error in the function occurs [here](https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L673-L676). Even if the scheduler fails to adopt a pod, the pod is still added to the KubernetesExecutor.running set. This piece of code was changed in #28871 and was merged into [2.5.2](https://airflow.apache.org/docs/apache-airflow/stable/release_notes.html#airflow-2-5-2-2023-03-15) In my case, the pod failed to be adopted because it had already been deleted (error 404). I decided to fix it by simply adding continue in the except block. After the fix, the situation improved a lot, but I still saw some TaskInstances in the KubernetesExecutor.running set that didn't belong to that scheduler. Then I found the second problem with this function. The [KubernetesExecutor._adopt_completed_pods](https://github.com/apache/airflow/blame/main/airflow/providers/cncf/kubernetes/executors/kubernetes_executor.py#L562) function is called unconditionally in the KubernetesExecutor.try_adopt_task_instances function, which is called by SchedulerJobRunner. SchedulerJobRunner sends a list of TaskInstances that need to be adopted because their Job.last_healthcheck missed the timeout. KubernetesExecutor.try_adopt_task_instances iterates through that list and tries to adopt (patch one of the labels) pods that belong to these TaskInstances. If adoption is successful, it adds the TaskIn stance to the KubernetesExecutor.running set. However, KubernetesExecutor._adopt_completed_pods, which is called during try_adopt_task_instances, does no such thing - it just gets the list of all completed pods and tries to adopt all completed pods that are not bound to the current scheduler. This results in the constant adoption of completed pods between schedulers. Scheduler 1 adopts completed pods of scheduler 2 and vice versa. So, how can we fix this? I think that the _adopt_completed_pods function needs to be removed because the presence of completed pods after scheduler failure is pretty harmless, and the [airflow cleanup-pods](https://github.com/apache/airflow/blob/main/airflow/cli/commands/kubernetes_command.py#L77-L150) CLI command can take care of that. Plus this function has already caused problems #26778 before. But I might be wrong. So if some maintainer can give advice on this situation, it would be great. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
