amoghrajesh commented on issue #65011:
URL: https://github.com/apache/airflow/issues/65011#issuecomment-4251037966

   Thanks for that input @shaleena.
   
   As per the code path, your workaround eliminates this:
   
   - `do_xcom_push = False` means that GlueJobRunDetailsLink.persist() returns 
early, ie: `glue_job_run_details` isnt written
   - That implies, `_push_xcom_if_needed` exits early as well and 
`return_value` is not written as xcom too
   - `_link_GlueJobRunDetailsLink` is still written by `finalize()` every time
   
   That last point is important. `_link_GlueJobRunDetailsLink` survives retries 
and clears without issue because it is always included in `xcom_keys_to_clear` 
when a task is about to start running. The two xcom keys causing issues are 
exactly the ones that can be written during a phase that skips cleanup.
   
   So, the only explanation I can have here is if something causes the task to 
resume again (no clearing of `next_method`), the old xcoms will be there 
because of 
https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py#L257-L267.
   
   I would like to ask you to check if trigger timed out / failed? I suspect 
this now: 
https://github.com/apache/airflow/blob/main/airflow-core/src/airflow/models/trigger.py#L314
 (non None).
   
   When task re runs, the API endpoint sees non None due to above, and performs 
zero cleanup but glue_job_run_details was there from first leg of the run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to