hussein-awala commented on PR #32249:
URL: https://github.com/apache/airflow/pull/32249#issuecomment-1636610244

   Good investigation, I just created a simple pod for testing:
   ```yaml
   apiVersion: v1
   kind: Pod
   metadata:
     name: test-pod
     labels:
       airflow-worker: scheduler1
   spec:
     restartPolicy: Always
     containers:
     - name: base
       image: ubuntu
       command: ["tail"]
       args: ["-f", "/dev/null"]
   ```
   and a watcher:
   ```python
   from kubernetes import client, config, watch
   
   if __name__ == '__main__':
       config.load_kube_config(context="<context>")
       v1 = client.CoreV1Api()
   
       kwargs = {"label_selector": "airflow-worker=scheduler1"}
   
       w = watch.Watch()
       for event in w.stream(v1.list_namespaced_pod, "<namespace>", **kwargs):
           print("Event: %s %s %s" % (event['type'], event['object'].kind, 
event['object'].metadata.name))
   ```
   and I got:
   ```
   Event: DELETED Pod test-pod
   ```
   when I patched the label with:
   ```bash
   kubectl --context <my context> --namespace <my namespace> pod/test-pod 
airflow-worker=scheduler2 --overwrite
   ```
   However, I don't think that your change could fix this issue. When we call:
   ```python
   for event in self._pod_events(kube_client=kube_client, query_kwargs=kwargs)
   ``` 
   The scheduler 1 will fetch only the events for the pods with 
`airflow-worker=scheduler1` at the moment of the call, so this issue happens 
after fetching the pod list, in this case, `labels.get("airflow-worker", None) 
!= scheduler_job_id` will be false. Instead, you can add a check in the method 
`process_status` before adding the pod to the watcher_queue with a failed 
state, in this case, when we receive a `DELETED` event, we could fetch the last 
state of the pod (with the new labels), and we compare the `airflow-worker` 
label with the scheduler id, if it's the same (normal case), we fail the TI, if 
not (the event was created by an adoption operation), we skip failing the TI. 
WDYT?
   
   (You need to merge/rebase master because #30727 has moved these methods to a 
new module)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to