adrianrego opened a new issue, #67813:
URL: https://github.com/apache/airflow/issues/67813

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.2
   
   ### What happened and how to reproduce it?
   
   Started seeing this issue after upgrading from Airflow 3.2.1 and 
apache-airflow-providers-cncf-kubernetes v10.14.0 to Airflow 3.2.2 and 
apache-airflow-providers-cncf-kubernetes v10.17.1
   
   AI Summary and diagnostic:
   
   **Issue Description**
   
   On scheduler startup, `SchedulerJobRunner.adopt_or_reset_orphaned_tasks` 
crashes with `DetachedInstanceError` while building a log message via 
`repr(ti)`. The exception escapes the scheduler loop and the process exits. 
Because the triggering rows persist in the metadata DB, every restart re-runs 
the same path and re-crashes — a deterministic CrashLoopBackOff. With HA 
replicas, all schedulers die, producing a full scheduling outage. Reproduced 
with both CeleryExecutor and KubernetesExecutor, so it is executor-agnostic.
   
   Traceback (abridged):
   
   ```
   File ".../airflow/jobs/scheduler_job_runner.py", line 2864, in 
adopt_or_reset_orphaned_tasks
       reset_tis_message.append(repr(ti))
   File ".../airflow/models/taskinstance.py", line 1134, in __repr__
       return prefix + f"[{self.state}] ti_id={self.id}>"
   File ".../sqlalchemy/orm/strategies.py", line 536, in _load_for_state
       raise orm_exc.DetachedInstanceError(...)
   sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <TaskInstance at 
0x...>
   is not bound to a Session; deferred load operation of attribute 'state' 
cannot proceed
   ```
   
   **Root cause**
   
   In `adopt_or_reset_orphaned_tasks`:
   
   1. The query selecting orphaned TIs uses `load_only(TI.dag_id, TI.task_id, 
TI.run_id, TI.external_executor_id)` — `state` is **deferred** (not loaded).
   2. The executor's `try_adopt_task_instances` returns the not-adopted TIs as 
`to_reset`; these instances are no longer bound to a live session.
   3. The reset loop runs `reset_tis_message.append(repr(ti))` **before** 
resetting state.
   4. `TaskInstance.__repr__` evaluates `f"[{self.state}] ti_id={self.id}>"`. 
Reading `self.state` on a detached instance with a deferred `state` column 
fires a lazy load with no session → `DetachedInstanceError`.
   
   The crash happens purely while constructing a log string; no reset work has 
occurred yet.
   
   Note: 3.2.2's PR #65711 added `external_executor_id` to the `load_only(...)` 
set but did **not** add `state`, and `__repr__` still reads `self.state`, so 
3.2.2 does not fix this.
   
   **Steps to reproduce**
   
   1. Get one or more TIs into an adoptable state 
(`queued`/`running`/`restarting`) whose `queued_by_job` points at a scheduler 
`job` row that is not `running`, in a still-`running` dag_run — i.e. the normal 
aftermath of an unclean scheduler shutdown (SIGKILL / OOM / pod eviction 
mid-run).
   2. Start a scheduler. During startup adopt, the executor declines to adopt 
those TIs and returns them as `to_reset`.
   3. `repr(ti)` on a detached, deferred-`state` instance raises 
`DetachedInstanceError`; the scheduler exits and crashloops.
   
   Diagnostic query mirroring the scheduler's own selection:
   
   ```sql
   SELECT ti.dag_id, ti.task_id, ti.run_id, ti.state,
          ti.queued_by_job_id, j.state AS job_state, j.latest_heartbeat
   FROM task_instance ti
   JOIN job j      ON j.id = ti.queued_by_job_id
   JOIN dag_run dr ON dr.dag_id = ti.dag_id AND dr.run_id = ti.run_id
   WHERE ti.state IN ('queued','running','restarting')
     AND j.state IS DISTINCT FROM 'running'
     AND dr.state = 'running';
   ```
   
   
   ### What you think should happen instead?
   
   `adopt_or_reset_orphaned_tasks` should reset (or log) orphaned TIs without 
crashing. `__repr__` in particular should never raise. Any one of the following 
fixes it:
   
   - Make `TaskInstance.__repr__` defensive — guard the deferred `state` read 
(e.g. check `inspect(self).unloaded`, or catch and render `state=<unloaded>`).
   - Add `TI.state` to the `load_only(...)` set in the orphan query.
   - Build the log message from already-loaded columns 
(`dag_id`/`task_id`/`run_id`) instead of `repr()`, or 
`session.refresh()`/re-bind the `to_reset` instances first.
   
   Prior reports cover only the 2.2-era variant (#19671, #23682, #58570); none 
track this 3.2.x deferred-`state` repr path.
   
   
   ### Operating System
   
   _No response_
   
   ### Deployment
   
   Other Docker-based deployment
   
   ### Apache Airflow Provider(s)
   
   cncf-kubernetes
   
   ### Versions of Apache Airflow Providers
   
   - apache-airflow-providers-celery==3.20.0
   - apache-airflow-providers-cncf-kubernetes==10.17.1
   - apache-airflow-providers-common-compat==1.15.0
    - apache-airflow-providers-pagerduty==5.2.5
    - apache-airflow-providers-http==6.0.2
    - apache-airflow-providers-amazon==9.29.0
    - apache-airflow-providers-fab==3.6.4
    - apache-airflow-providers-google==22.0.0
    - apache-airflow-providers-standard==1.13.1
   
   ### Official Helm Chart version
   
   1.21.0 (latest released)
   
   ### Kubernetes Version
   
   v1.33.11-eks-40737a8
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to