mgorsk1 opened a new issue #17238:
URL: https://github.com/apache/airflow/issues/17238


   **Apache Airflow version**:
   
   **2.1.2**
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   
   **v1.19.4**
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: self-hosted k8s
   - **OS** (e.g. from /etc/os-release):
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**: **KubernetesExecutor**
   
   **What happened**:
   
   We are running DAG on k8s using DummyOperator, PythonOperator and 
KubernetesPodOperator. 
   
   From time to time random tasks in the DAG are marked as `skipped` although 
we are not raising AirflowSkipException in our code and all previous tasks 
finished successfully:
   
   
![image](https://user-images.githubusercontent.com/5680655/127046878-4c3b7d46-d0cd-40e9-8403-d418f50e55bd.png)
   
   From the UI it looks like the task was not run at all (and there are no task 
execution logs in UI) it actually seems that the task pod (worker) was executed 
and finished. We are able to find in Airflow scheduler logs what was the unique 
pod name for the `end` task and logs proving that it has completed ok. 
Additionally our k8s monitoring says that the pod of the same name indeed 
existed within the namespace at the exact time Airflow logs say it was spawned.
   
   The situation occurs for every type of operator (so it's not only for 
DummyOperator used for `end` task).
   
   It also makes whole DAG Run land in `failed` state.
   
   **What you expected to happen**:
   
   Both task instance and DAG run are `success`
   
   **How to reproduce it**:
   
   It occurs randomly and we are not sure how to reproduce it. It seems that it 
occurs more often if there are a lot of DAGs/tasks running in Airflow at once 
(although we limit task parallelism to 16 per DAG and 32 per whole Airflow 
instance).
   
   **Anything else we need to know**:
   
   What's also interesting is that it almost every time happens to whole group 
of tasks. 
   
   Below diagram shows rendered DAG and 3 situations of equal probability of 
happening (green box is task=success and pink box is task=skipped). TG1 TG2 TG3 
are identifiers for different operators used (either python or 
kubernetespodoperator):
   
   
![image](https://user-images.githubusercontent.com/5680655/127047947-02c14562-3f02-494c-9a04-fb8925cd9d5d.png)
   
   How often does this problem occur? Once? Every time etc?
   
   It occurs randomly - one day the same DAG is ok, the other way it suffers 
from this issue.
   
   Any relevant logs to include? Put them here in side a detail tag:
   
![image](https://user-images.githubusercontent.com/5680655/127048124-af5d17fe-9ea2-4889-b5ce-d3a44a5158f9.png)
   
   Above log shows that Airflow actually executed worker pod with task and it 
succeeded, but status of the task in UI claims otherwise:
   
![image](https://user-images.githubusercontent.com/5680655/127048222-a8a41a8a-f3f8-44b4-b0da-0d05b6d8faf2.png)
   
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to