luqic opened a new issue #16625:
URL: https://github.com/apache/airflow/issues/16625
**Apache Airflow version**: 2.0.2
**Kubernetes version**:
```
Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8",
GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean",
BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc",
Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+",
GitVersion:"v1.17.17-gke.4900",
GitCommit:"2812f9fb0003709fc44fc34166701b377020f1c9", GitTreeState:"clean",
BuildDate:"2021-03-19T09:19:27Z", GoVersion:"go1.13.15b4", Compiler:"gc",
Platform:"linux/amd64"}
```
- **Cloud provider or hardware configuration**: GKE
**What happened**:
After the worker pod for the task failed to start, the task is marked as
failed with the error message `Executor reports task instance <TaskInstance:
datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries
2021-06-14 00:00:00+00:00 [queued]> finished (failed) although the task says
its queued. (Info: None) Was the task killed externally?`. The task should have
been reattempted as it still has retries left.
```
{kubernetes_executor.py:147} INFO - Event:
datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d
had an event of type MODIFIED
{kubernetes_executor.py:202} ERROR - Event:
datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d
Failed
{kubernetes_executor.py:352} INFO - Attempting to finish pod; pod_id:
datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d;
state: failed; annotations: {'dag_id': 'datalake_db_cdc_data_integrity',
'task_id': 'check_integrity_core_prod_my_industries', 'execution_date':
'2021-06-14T00:00:00+00:00', 'try_number': '1'}
{kubernetes_executor.py:532} INFO - Changing state of
(TaskInstanceKey(dag_id='datalake_db_cdc_data_integrity',
task_id='check_integrity_core_prod_my_industries',
execution_date=datetime.datetime(2021, 6, 14, 0, 0, tzinfo=tzlocal()),
try_number=1), 'failed',
'datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d',
'prod', '1510796520') to failed
{scheduler_job.py:1210} INFO - Executor reports execution of
datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries
execution_date=2021-06-14 00:00:00+00:00 exited with status failed for
try_number 1
{scheduler_job.py:1239} ERROR - Executor reports task instance
<TaskInstance:
datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries
2021-06-14 00:00:00+00:00 [queued]> finished (failed) although the task says
its queued. (Info: None) Was the task killed externally?
```
**What you expected to happen**:
The task status should have been set as `up_for_retry` instead of failing
immediately.
**Anything else we need to know**:
This error has occurred 6 times over the past 2 months, and to seemingly
random tasks in different DAGs. We run 60 DAGs with 50-100 tasks each every 30
minutes. The affected tasks are a mix of PythonOperator and
SparkSubmitOperator. The first time we saw it was in mid Apr, and we were on
Airflow version 2.0.1. We upgraded to Airflow version 2.0.2 in early May, and
the error has occurred 3 more times since then.
Also, the issue where the worker pod cannot start is a common error that we
frequently encounter, but in most cases these tasks are correctly marked as
`up_for_retry` and reattempted.
This is currently not a big issue for us since it's so rare, but we have to
manually clear the tasks that failed to get them to rerun because the tasks are
not retrying. They have all succeeded on the first try after clearing.
Also, I'm not sure if this issue is related to #10790 or #16285, so I just
created a new one. It's not quite the same as #10790 because the tasks affected
are not ExternalTaskSensors, and also #16285 because the offending lines
pointed out there are not in 2.0.2.
Thanks!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]