arjunanan6 commented on issue #33402:
URL: https://github.com/apache/airflow/issues/33402#issuecomment-1680258882

   > I have two questions:
   > 
   > 1. I see in the log:
   >    ```
   >    Deleted pod: TaskInstanceKey(dag_id='SAP_WMD_AGWI_RSRVD_TIMEOUT_H', 
task_id='SAP_WMD_AGWI_RSRVD_TIMEOUT_H', 
run_id='scheduled__2023-08-15T10:05:00+00:00', try_number=1, map_index=-1) in 
namespace airflow-test-ns
   >    ```
   >    
   >    
   >        
   >          
   >        
   >    
   >          
   >        
   >    
   >        
   >      
   >    Was this pod actually deleted by the executor? (you can check a new 
task if you don't have the information)
   > 2. If the answer to my first question is yes, then we can conclude that 
the problem is partial, and the executor is removing most of the pods. So the 
second question, are the undeleted pods completed within the same time range of 
the `KubernetesJobWatcher` exception (before or after a minute of this 
exception).
   > 
   > My guess is that we lose some events when the job watcher fails/restarts.
   
   Yes, the pod was actually deleted in that instance - same goes for all the 
other "Deleted pod" log messages. It just stops removing pods at a certain 
point after that. For example in that log from earlier, it last deleted: 
"IT_ERPBUT_GRAC_ACTION_USAGE_SYNC_H", but then it fails to remove pods starting 
from "IT_ERPBUT_GRAC_BATCH_RISK_ANALYSIS_H".
   
   If I were to kill the scheduler now, it will remove all Completed pods that 
it previously couldn't until KubernetesJobWatcher dies again at some point 
(typically 20-30 min after launch from my observation).  
   
   
   I prefer not to change 
`AIRFLOW__KUBERNETES_EXECUTOR__DELETE_WORKER_PODS_ON_FAILURE` because I do want 
to see errored out pods. More specifically so because the tasks are 
successfully executed - it's just the `Completed` pods stop getting removed. I 
have a cronjob that removes Completed pods every hour now that's acting as a 
bandaid fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to