resume in 3.1.8 [airflow]

via GitHub Wed, 15 Apr 2026 05:44:45 -0700


shaleena commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4252088077


   Thanks @amoghrajesh  — I checked the deferrable and non deferrable try logs 
, and they seem to rule out the trigger-timeout / failed-trigger theory for 
this case.
   
   ### Deferrable try 1
   
   The trigger path looks normal:
   
   ```text
   [2026-04-10 06:01:05] INFO - Status of AWS Glue job is: RUNNING
   [2026-04-10 06:02:05] INFO - Status of AWS Glue job is: RUNNING
   [2026-04-10 06:03:06] INFO - Status of AWS Glue job is: RUNNING
   [2026-04-10 06:04:06] INFO - Trigger fired event ... 
result=TriggerEvent<{'status': 'success', 'run_id': 'jr_7bf3d105...'}>
   [2026-04-10 06:04:06] INFO - trigger completed ...
   ```
   
   The task then resumes on:
   
   ```text
   [2026-04-10 06:04:08] INFO - TaskInstance Details ... try_number=1
   ```
   
   and immediately fails during the standard XCom auto-push path:
   
   ```text
   [2026-04-10 06:04:08] INFO - Pushing xcom ti=RuntimeTaskInstance(...)
   [2026-04-10 06:04:08] ERROR - Task failed with exception
   
   duplicate key value violates unique constraint "xcom_pkey"
   DETAIL:  Key (dag_run_id, task_id, map_index, key)=(10645, run_job_task, -1, 
return_value) already exists.
   ```
   
   
   ```text
   task_runner.py ... _push_xcom_if_needed
   task_runner.py ... _xcom_push
   xcom.py ... set
   comms.py ... send
   ```
   
   And after failure:
   
   ```text
   [2026-04-10 06:04:09] WARNING - No XCom value found; defaulting to None. 
key=glue_job_run_details ...
   ```
   
   ### Deferrable try 2
   
   The same DAG run then shows the same pattern again on try 2:
   
   ```text
   [2026-04-10 06:12:16] INFO - Trigger fired event ... 
result=TriggerEvent<{'status': 'success', 'run_id': 'jr_e92acd...'}>
   [2026-04-10 06:12:16] INFO - trigger completed ...
   [2026-04-10 06:12:40] INFO - TaskInstance Details ... try_number=2
   [2026-04-10 06:12:41] INFO - Pushing xcom ti=RuntimeTaskInstance(...)
   [2026-04-10 06:12:41] ERROR - Task failed with exception
   DETAIL:  Key (dag_run_id, task_id, map_index, key)=(10645, run_job_task, -1, 
return_value) already exists.
   ```
   
   This suggests:
   
   * the trigger is completing successfully
   * the failure already occurs on the resumed leg of **try 1**
   * so this does not appear to depend on:
     * trigger timeout
     * trigger failure
     * stale XCom left behind only by an earlier retry
   
   For this run the conflicting `return_value` seems to already exist by the 
time `_push_xcom_if_needed` runs on the first resumed attempt.
   
   Also, we see a similar duplicate-`return_value` failure in a 
**non-deferrable try** 1 and 2  run, which suggests this may be broader than 
the deferrable resume / `next_method` path alone.
   
   Our workaround remains the same:
   
   * `do_xcom_push = False`
   * avoid `return_value`
   * extract `run_id` from `event["run_id"]`
   * push a custom key instead
   
   That avoids the failure consistently in both deferrable and non-deferrable 
modes.
   
   Both jobs failed on a scheduled run, no retry or clear was performed 
   
   we can share the full logs if needed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] GlueJobOperator can hit duplicate XCom keys across retry / deferral / resume in 3.1.8 [airflow]

Reply via email to