stijndehaes edited a comment on issue #16625:
URL: https://github.com/apache/airflow/issues/16625#issuecomment-895776139


   We noticed this issue with Airflow 2.1.2. Job went from queued to failed 
without retry, looking at the code I am not sure how to fix it. It is clear 
that in `scheduler_job.py` on line 1238 we see the relevant logs.
   ~Maybe there should be logic here to check if the task needs to be retried 
and change the state to retried if needed? That logic is currently completely 
circumvented by just setting the state from the scheduler.~
   Looking at the code again, a `TaskCallbackRequest` event is sent to the 
processor_agent, this will eventually be processed by the function 
`execute_callbacks`, that will execute the task instance method 
`handle_failure_with_callback`, this one should set the state of the task 
instance to `Up for retry`, however this does not happen for some reason.
   
   The relevant logs (the dag and task name are erased because they might 
contain sensitive information):
   ```
   
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   |       @timestamp        |                                                  
                                                                                
                                                                                
              log                                                               
                                                                                
                                                                                
 |
   
|-------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
   | 2021-08-06 17:27:30.210 | [2021-08-06 17:27:30,209] 
{scheduler_job.py:1254} ERROR - Executor reports task instance <TaskInstance: 
xxxx 1990-06-10 00:00:00+00:00 [queued]> finished (failed) although the task 
says its queued. (Info: None) Was the task killed externally?                   
                                                                                
                                                                                
   |
   | 2021-08-06 17:27:30.205 | [2021-08-06 17:27:30,204] 
{scheduler_job.py:1218} INFO - Executor reports execution of xxxx 
execution_date=1990-06-10 00:00:00+00:00 exited with status failed for 
try_number 5                                                                    
                                                                                
                                                                                
                     |
   | 2021-08-06 17:27:30.204 | [2021-08-06 17:27:30,204] 
{kubernetes_executor.py:546} INFO - Changing state of 
(TaskInstanceKey(dag_id='xxxx', task_id='xxxx', 
execution_date=datetime.datetime(1990, 6, 10, 0, 0, tzinfo=tzlocal()), 
try_number=5), 'failed', 'xxxx, 'dev', '97113730') to failed                    
                                                                                
              |
   | 2021-08-06 17:27:30.203 | [2021-08-06 17:27:30,202] 
{kubernetes_executor.py:368} INFO - Attempting to finish pod; pod_id: xxxx; 
state: failed; annotations: {'dag_id': 'xxxx', 'task_id': 'xxxx', 
'execution_date': '1990-06-10T00:00:00+00:00', 'try_number': '5'}               
                                                                                
                                              |
   | 2021-08-06 17:22:23.371 | [2021-08-06 17:22:23,371] 
{scheduler_job.py:1245} INFO - Setting external_id for <TaskInstance: xxxx 
1990-06-10 00:00:00+00:00 [queued]> to 1606                                     
                                                                                
                                                                                
                                                                                
   |
   | 2021-08-06 17:22:23.367 | [2021-08-06 17:22:23,367] 
{scheduler_job.py:1218} INFO - Executor reports execution of xxxx 
execution_date=1990-06-10 00:00:00+00:00 exited with status queued for 
try_number 5                                                                    
                                                                                
                                                                                
                     |
   2021-08-06 17:27:30.205 | [2021-08-06 17:27:30,204] {scheduler_job.py:1218} 
INFO - Executor reports execution of xxxxr execution_date=1990-06-10 
00:00:00+00:00 exited with status failed for try_number 5
   ```
   
   Kubernetes version (EKS):
   Server Version: version.Info{Major:"1", Minor:"21+", 
GitVersion:"v1.21.2-eks-0389ca3", 
GitCommit:"8a4e27b9d88142bbdd21b997b532eb6d493df6d2", GitTreeState:"clean", 
BuildDate:"2021-07-31T01:34:46Z", GoVersion:"go1.16.5", Compiler:"gc", 
Platform:"linux/amd64"}


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to