adrianrego opened a new issue, #67813:
URL: https://github.com/apache/airflow/issues/67813
### Under which category would you file this issue?
Airflow Core
### Apache Airflow version
3.2.2
### What happened and how to reproduce it?
Started seeing this issue after upgrading from Airflow 3.2.1 and
apache-airflow-providers-cncf-kubernetes v10.14.0 to Airflow 3.2.2 and
apache-airflow-providers-cncf-kubernetes v10.17.1
AI Summary and diagnostic:
**Issue Description**
On scheduler startup, `SchedulerJobRunner.adopt_or_reset_orphaned_tasks`
crashes with `DetachedInstanceError` while building a log message via
`repr(ti)`. The exception escapes the scheduler loop and the process exits.
Because the triggering rows persist in the metadata DB, every restart re-runs
the same path and re-crashes — a deterministic CrashLoopBackOff. With HA
replicas, all schedulers die, producing a full scheduling outage. Reproduced
with both CeleryExecutor and KubernetesExecutor, so it is executor-agnostic.
Traceback (abridged):
```
File ".../airflow/jobs/scheduler_job_runner.py", line 2864, in
adopt_or_reset_orphaned_tasks
reset_tis_message.append(repr(ti))
File ".../airflow/models/taskinstance.py", line 1134, in __repr__
return prefix + f"[{self.state}] ti_id={self.id}>"
File ".../sqlalchemy/orm/strategies.py", line 536, in _load_for_state
raise orm_exc.DetachedInstanceError(...)
sqlalchemy.orm.exc.DetachedInstanceError: Parent instance <TaskInstance at
0x...>
is not bound to a Session; deferred load operation of attribute 'state'
cannot proceed
```
**Root cause**
In `adopt_or_reset_orphaned_tasks`:
1. The query selecting orphaned TIs uses `load_only(TI.dag_id, TI.task_id,
TI.run_id, TI.external_executor_id)` — `state` is **deferred** (not loaded).
2. The executor's `try_adopt_task_instances` returns the not-adopted TIs as
`to_reset`; these instances are no longer bound to a live session.
3. The reset loop runs `reset_tis_message.append(repr(ti))` **before**
resetting state.
4. `TaskInstance.__repr__` evaluates `f"[{self.state}] ti_id={self.id}>"`.
Reading `self.state` on a detached instance with a deferred `state` column
fires a lazy load with no session → `DetachedInstanceError`.
The crash happens purely while constructing a log string; no reset work has
occurred yet.
Note: 3.2.2's PR #65711 added `external_executor_id` to the `load_only(...)`
set but did **not** add `state`, and `__repr__` still reads `self.state`, so
3.2.2 does not fix this.
**Steps to reproduce**
1. Get one or more TIs into an adoptable state
(`queued`/`running`/`restarting`) whose `queued_by_job` points at a scheduler
`job` row that is not `running`, in a still-`running` dag_run — i.e. the normal
aftermath of an unclean scheduler shutdown (SIGKILL / OOM / pod eviction
mid-run).
2. Start a scheduler. During startup adopt, the executor declines to adopt
those TIs and returns them as `to_reset`.
3. `repr(ti)` on a detached, deferred-`state` instance raises
`DetachedInstanceError`; the scheduler exits and crashloops.
Diagnostic query mirroring the scheduler's own selection:
```sql
SELECT ti.dag_id, ti.task_id, ti.run_id, ti.state,
ti.queued_by_job_id, j.state AS job_state, j.latest_heartbeat
FROM task_instance ti
JOIN job j ON j.id = ti.queued_by_job_id
JOIN dag_run dr ON dr.dag_id = ti.dag_id AND dr.run_id = ti.run_id
WHERE ti.state IN ('queued','running','restarting')
AND j.state IS DISTINCT FROM 'running'
AND dr.state = 'running';
```
### What you think should happen instead?
`adopt_or_reset_orphaned_tasks` should reset (or log) orphaned TIs without
crashing. `__repr__` in particular should never raise. Any one of the following
fixes it:
- Make `TaskInstance.__repr__` defensive — guard the deferred `state` read
(e.g. check `inspect(self).unloaded`, or catch and render `state=<unloaded>`).
- Add `TI.state` to the `load_only(...)` set in the orphan query.
- Build the log message from already-loaded columns
(`dag_id`/`task_id`/`run_id`) instead of `repr()`, or
`session.refresh()`/re-bind the `to_reset` instances first.
Prior reports cover only the 2.2-era variant (#19671, #23682, #58570); none
track this 3.2.x deferred-`state` repr path.
### Operating System
_No response_
### Deployment
Other Docker-based deployment
### Apache Airflow Provider(s)
cncf-kubernetes
### Versions of Apache Airflow Providers
- apache-airflow-providers-celery==3.20.0
- apache-airflow-providers-cncf-kubernetes==10.17.1
- apache-airflow-providers-common-compat==1.15.0
- apache-airflow-providers-pagerduty==5.2.5
- apache-airflow-providers-http==6.0.2
- apache-airflow-providers-amazon==9.29.0
- apache-airflow-providers-fab==3.6.4
- apache-airflow-providers-google==22.0.0
- apache-airflow-providers-standard==1.13.1
### Official Helm Chart version
1.21.0 (latest released)
### Kubernetes Version
v1.33.11-eks-40737a8
### Helm Chart configuration
_No response_
### Docker Image customizations
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]