luqic opened a new issue #16625:
URL: https://github.com/apache/airflow/issues/16625


   **Apache Airflow version**: 2.0.2
   
   
   **Kubernetes version**:
   ```
   Client Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.8", 
GitCommit:"9f2892aab98fe339f3bd70e3c470144299398ace", GitTreeState:"clean", 
BuildDate:"2020-08-13T16:12:48Z", GoVersion:"go1.13.15", Compiler:"gc", 
Platform:"darwin/amd64"}
   Server Version: version.Info{Major:"1", Minor:"17+", 
GitVersion:"v1.17.17-gke.4900", 
GitCommit:"2812f9fb0003709fc44fc34166701b377020f1c9", GitTreeState:"clean", 
BuildDate:"2021-03-19T09:19:27Z", GoVersion:"go1.13.15b4", Compiler:"gc", 
Platform:"linux/amd64"}
   ```
   
   - **Cloud provider or hardware configuration**: GKE
   
   **What happened**:
   
   After the worker pod for the task failed to start, the task is marked as 
failed with the error message `Executor reports task instance <TaskInstance: 
datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries 
2021-06-14 00:00:00+00:00 [queued]> finished (failed) although the task says 
its queued. (Info: None) Was the task killed externally?`. The task should have 
been reattempted as it still has retries left. 
   
   ```
   {kubernetes_executor.py:147} INFO - Event: 
datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d
 had an event of type MODIFIED
   {kubernetes_executor.py:202} ERROR - Event: 
datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d
 Failed
   {kubernetes_executor.py:352} INFO - Attempting to finish pod; pod_id: 
datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d;
 state: failed; annotations: {'dag_id': 'datalake_db_cdc_data_integrity', 
'task_id': 'check_integrity_core_prod_my_industries', 'execution_date': 
'2021-06-14T00:00:00+00:00', 'try_number': '1'}
   {kubernetes_executor.py:532} INFO - Changing state of 
(TaskInstanceKey(dag_id='datalake_db_cdc_data_integrity', 
task_id='check_integrity_core_prod_my_industries', 
execution_date=datetime.datetime(2021, 6, 14, 0, 0, tzinfo=tzlocal()), 
try_number=1), 'failed', 
'datalakedbcdcdataintegritycheckintegritycoreprodmyindustries.17f690ef0328488fadeba2dd00f8175d',
 'prod', '1510796520') to failed
   {scheduler_job.py:1210} INFO - Executor reports execution of 
datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries 
execution_date=2021-06-14 00:00:00+00:00 exited with status failed for 
try_number 1
   {scheduler_job.py:1239} ERROR - Executor reports task instance 
<TaskInstance: 
datalake_db_cdc_data_integrity.check_integrity_core_prod_my_industries 
2021-06-14 00:00:00+00:00 [queued]> finished (failed) although the task says 
its queued. (Info: None) Was the task killed externally?
   ```
   
   **What you expected to happen**:
   
   The task status should have been set as `up_for_retry` instead of failing 
immediately.
   
   
   **Anything else we need to know**:
   
   This error has occurred 6 times over the past 2 months, and to seemingly 
random tasks in different DAGs. We run 60 DAGs with 50-100 tasks each every 30 
minutes. The affected tasks are a mix of PythonOperator and 
SparkSubmitOperator. The first time we saw it was in mid Apr, and we were on 
Airflow version 2.0.1. We upgraded to Airflow version 2.0.2 in early May, and 
the error has occurred 3 more times since then. 
   
   Also, the issue where the worker pod cannot start is a common error that we 
frequently encounter, but in most cases these tasks are correctly marked as 
`up_for_retry` and reattempted. 
   
   This is currently not a big issue for us since it's so rare, but we have to 
manually clear the tasks that failed to get them to rerun because the tasks are 
not retrying. They have all succeeded on the first try after clearing.
   
   Also, I'm not sure if this issue is related to #10790 or #16285, so I just 
created a new one. It's not quite the same as #10790 because the tasks affected 
are not ExternalTaskSensors, and also #16285 because the offending lines 
pointed out there are not in 2.0.2. 
   
   Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to