bhavaniravi opened a new issue #11190:
URL: https://github.com/apache/airflow/issues/11190


   
   **Apache Airflow version**:
   1.10.
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl 
version`): 1.16
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AWS
   - **OS** (e.g. from /etc/os-release): 
   - **Kernel** (e.g. `uname -a`):
   - **Install tools**:
   - **Others**:
   
   **What happened**:
   
   We are an airflow set up in production on a Kubernetes cluster. We found 
that some of the tasks using `KubernetesPodOperator` are failing abruptly. 
   
   On checking the task logs on airflow it said `Pod Took too long to start` 
along with the pod id
   
   On checking the pod ID in rancher, we found that the pod had ran and 
successfully completed the task. 
   
   I spent some time digging the airflow code, the `run_pod` function 
   1. Raises and exception
   2. Sets the task state as a failure
   
   But it doesn't do anything about the pod in pending state. 
   
   ```
   def run_pod(self, pod, startup_timeout=120, get_logs=True):
        ...
        if delta.total_seconds() >= startup_timeout:
              raise AirflowException("Pod took too long to start")
   ```
   
   **What you expected to happen**:
   
   With the KubernetesPodOperator if the task is set to Failure due to any 
event, the corresponding task pod should be shut down
   
   **How to reproduce it**:
   
   1. Have the pod hang in pending state for more than 120 seconds
   2. Reduce the startup_timeout to a lower number 
   
   
   **Anything else we need to know**:
   Random
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to