pceric opened a new issue #12111:
URL: https://github.com/apache/airflow/issues/12111
**Apache Airflow version**: 1.10.12
**Kubernetes version (if you are using kubernetes)** (use `kubectl
version`): 1.15.9
**Environment**:
- **Cloud provider or hardware configuration**: AWS
- **OS** (e.g. from /etc/os-release): Debian 9 (Stretch)
**What happened**: As of Airflow 1.10.12, and going back to sometime around
1.10.10 or 1.10.11, the behavior of the retry mechanism in the kubernetes pod
operator regressed. Previously when a pod failed due to an error, Airflow would
spin up a new pod in kubernetes on retry. As of 1.10.12 Airflow now tries to
re-use the same broken pod over and over:
`INFO - found a running pod with labels {'dag_id': 'my_dad', 'task_id':
'my_task', 'execution_date': '2020-11-04T1300000000-e807cde8a', 'try_number':
'6'} but a different try_number. Will attach to this pod and monitor instead of
starting new one`
This is bad because most failures we encounter are due to the underlying
"physical" hardware failing and retrying on the same pod is pointless, it will
never succeed.
**What you expected to happen**: I would expect the k8s Airflow operator to
start a new pod that would allow it to be scheduled on a new k8s node that does
not have an underlying "physical" hardware problem, just like it was on earlier
versions of Airflow.
**How to reproduce it**: Run a kubernetes pod operator task with a retry
count set and error the node in a way that it can never succeed.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]