pceric opened a new issue #12111:
URL: https://github.com/apache/airflow/issues/12111


   **Apache Airflow version**: 1.10.12
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl 
version`): 1.15.9
   
   **Environment**: 
   - **Cloud provider or hardware configuration**: AWS
   - **OS** (e.g. from /etc/os-release): Debian 9 (Stretch)
   
   **What happened**: As of Airflow 1.10.12, and going back to sometime around 
1.10.10 or 1.10.11, the behavior of the retry mechanism in the kubernetes pod 
operator regressed. Previously when a pod failed due to an error, Airflow would 
spin up a new pod in kubernetes on retry. As of 1.10.12 Airflow now tries to 
re-use the same broken pod over and over:
   
   `INFO - found a running pod with labels {'dag_id': 'my_dad', 'task_id': 
'my_task', 'execution_date': '2020-11-04T1300000000-e807cde8a', 'try_number': 
'6'} but a different try_number. Will attach to this pod and monitor instead of 
starting new one`
   
   This is bad because most failures we encounter are due to the underlying 
"physical" hardware failing and retrying on the same pod is pointless, it will 
never succeed.
   
   **What you expected to happen**: I would expect the k8s Airflow operator to 
start a new pod that would allow it to be scheduled on a new k8s node that does 
not have an underlying "physical" hardware problem, just like it was on earlier 
versions of Airflow.
   
   **How to reproduce it**: Run a kubernetes pod operator task with a retry 
count set and error the node in a way that it can never succeed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to