amoghrajesh commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4257842253

   Thanks for the detailed logs. The trigger completing successfully rules out 
the trigger error situation I mentioned earlier.
   
   Looking at this again and one the thing really stands out is that 
`glue_job_run_details` gets a 409 within 4 seconds of try 1 starting on a fresh 
run. That key is written at the very beginning of `execute()` (deferrable or 
non deferrable), so for it to already exist in the DB at that point, something 
must have written it just before try 1. 
   
   The most plausible explanation is a prior execution of the same task that 
the scheduler might have killed for missing a heartbeat or such, the worker was 
still alive and committing its xcoms while the newer try cleanup query had 
already run, so the cleanup missed those keys.
   
   To confirm, could you grep the scheduler logs for "Detected a task instance 
without a heartbeat" in that window? If there is such a find, that is the prior 
execution whose in the flight xcom write collided with try 1.
   
   Also useful, what is the value of `task_instance_heartbeat_timeout` and how 
many celery workers are you running?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to