sweco edited a comment on issue #19329:
URL: https://github.com/apache/airflow/issues/19329#issuecomment-963226239
Hey there, we're observing the same issue on Airflow 2.1.4 in a big DAG
having hundreds of tasks. By looking in the scheduler logs from the moment when
the downstream task was marked as `upstream_failed` it seems that it happened
when a pod with a task could not be started.
However, the question is, why is the task put in a failed state instead of
retrying? And if this is normal, why does not the downstream wait for all the
retries to be finished?
```
Pod "pod_name" has been pending for longer than 300 seconds. It will be
deleted and set to failed.
Event: pod_name had an event of type MODIFIED
Event: pod_name Pending
Event: pod_name had an event of type DELETED
Event: Failed to start pod pod_name
Attempting to finish pod; pod_id: pod_name; state: failed; annotations:
{..., 'try_number': 1}
Changing state of task_instance, <TaskInstanceState.FAILED: 'failed'>,
'pod_name', 'airflow', '...') to failed
Executor reports execution of task_id execution_date=execution_date exited
with status failed for try_number 1
Executor reports task instance <TaskInstance: task_id execution_date
[queued]> finished (failed) although the task says its queued. (Info: None) Was
the task killed externally?
Setting task instance <TaskInstance: task_name execution_date [queued]>
state to failed as reported by executor
# Later on
1 tasks up for execution:
<task_id execution_date [scheduled]>
```
Also, looking at the successful task, it indeed has two tries, from which
the first one really finished at the incriminated timestamp.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]