droppoint commented on PR #35800:
URL: https://github.com/apache/airflow/pull/35800#issuecomment-1833844894

   > Hi @droppoint let us know what you find
   
   My team and I ran an experiment that demonstrated that even if the scheduler 
shuts down abnormally, the TaskInstance still completes normally. This 
observation also applies to DagRun and LocalTaskJob of this TaskInstance. 
TaskInstance completes normally because state of it changes within the 
[_run_raw_task](https://github.com/apache/airflow/blob/main/airflow/models/taskinstance.py#L2292)
 function from within the worker pod.
   
   Here's a step-by-step breakdown of our experiment:
   0. Set the number of schedulers in the namespace to 2.
   1. Create a DAG that sleeps for 5 minutes.
   2. Set orphaned_tasks_check_interval to 20 minutes.
   3. Run the DAG on scheduler №1.
   4. Wait until DAGRun/Job/TaskInstance/Pod is in the "Running" state.
   5. Kill scheduler №1 and prevent its restart.
   6. Wait until the pod is in the Completed state.
   7. Wait until adoption starts on scheduler №2.
   8. Wait until the cleanup-pods cronjob starts.
   
   Results:
   - TaskInstance/DAGRun/Job status changed to "success" after step 6 but 
before step 7.
   - The pod was deleted only after step 8.
   
   So, pods that were completed after a scheduler's abnormal shutdown do not 
lead to TaskInstance/DagRun/Job failure, even if they were not "adopted." While 
the pod was deleted after step 8 by the cleanup-pods cronjob, I understand the 
concern raised by @JCoder01 that we need to clean up pods properly even in this 
case. In the next step, we'll attempt to implement a new version of the 
_adopt_completed_pods function that retrieves IDs of working SchedulerJobs and 
deletes all pods in the Completed state that don't belong to "running" 
SchedulerJobs, as @dstandish suggested. We'll test this solution on our Airflow 
setup and provide more information approximately next week.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to