syun64 opened a new issue, #30170:
URL: https://github.com/apache/airflow/issues/30170
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
Airflow version 2.5.0
On reboot, the Airflow scheduler went into a bad state where the scheduler
loop crashed when it tried to queue task_instances. Interestingly, it returned
a 200 response on the REST healthcheck endpoint just 15 seconds later,
signaling that the scheduler was healthy.
`{"metadatabase": {"status": "healthy"}, "scheduler":
{"latest_scheduler_heartbeat": "2023-03-16T21:37:30.551002+00:00", "status":
"healthy"}}`
Timetamp on traceback = 2023-03-16 17:37:15.865
```
Traceback (most recent call last):
File "/airflow/__main__.py", line 43, in <module>
main()
File "/airflow/__main__.py", line 39, in main
args.func(args)
File "/airflow/cli/cli_parser.py", line 52, in command
return func(*args, **kwargs)
File "/airflow/utils/cli.py", line 108, in wrapper
return f(*args, **kwargs)
File "/airflow/cli/commands/scheduler_command.py", line 73, in scheduler
_run_scheduler_job(args=args)
File "/airflow/cli/commands/scheduler_command.py", line 43, in
_run_scheduler_job
job.run()
File "/airflow/jobs/base_job.py", line 247, in run
self._execute()
File "/airflow/jobs/scheduler_job.py", line 759, in _execute
self._run_scheduler_loop()
File "/airflow/jobs/scheduler_job.py", line 885, in _run_scheduler_loop
num_queued_tis = self._do_scheduling(session)
File "/airflow/jobs/scheduler_job.py", line 991, in _do_scheduling
num_queued_tis =
self._critical_section_enqueue_task_instances(session=session)
File "/airflow/jobs/scheduler_job.py", line 582, in
_critical_section_enqueue_task_instances
queued_tis = self._executable_task_instances_to_queued(max_tis,
session=session)
File "/airflow/jobs/scheduler_job.py", line 340, in
_executable_task_instances_to_queued
task_instances_to_examine: list[TI] = with_row_locks(
File "/sqlalchemy/orm/query.py", line 2772, in all
return self._iter().all()
File "/sqlalchemy/orm/query.py", line 2915, in _iter
result = self.session.execute(
File "/sqlalchemy/orm/session.py", line 1717, in execute
result = compile_state_cls.orm_setup_cursor_result(
File "/sqlalchemy/orm/context.py", line 349, in orm_setup_cursor_result
return loading.instances(result, querycontext)
File "/sqlalchemy/orm/loading.py", line 89, in instances
cursor.close()
File "/sqlalchemy/util/langhelpers.py", line 70, in __exit__
compat.raise_(
File "/sqlalchemy/util/compat.py", line 210, in raise_
raise exception
File "/sqlalchemy/orm/loading.py", line 69, in instances
*[
File "/sqlalchemy/orm/loading.py", line 70, in <listcomp>
query_entity.row_processor(context, cursor)
File "/sqlalchemy/orm/context.py", line 2631, in row_processor
_instance = loading._instance_processor(
File "/sqlalchemy/orm/loading.py", line 715, in _instance_processor
primary_key_getter = result._tuple_getter(pk_cols)
File "/sqlalchemy/engine/result.py", line 961, in _tuple_getter
return self._metadata._row_as_tuple_getter(keys)
File "/sqlalchemy/engine/result.py", line 106, in _row_as_tuple_getter
indexes = self._indexes_for_keys(keys)
File "/sqlalchemy/engine/cursor.py", line 669, in _indexes_for_keys
CursorResultMetaData._key_fallback(self, ke.args[0], ke)
File "/sqlalchemy/engine/cursor.py", line 628, in _key_fallback
util.raise_(
File "/sqlalchemy/util/compat.py", line 210, in raise_
raise exception
sqlalchemy.exc.NoSuchColumnError: Could not locate column in row for column
'task_instance.dag_id'
```
Simply bouncing the scheduler brought it back up to normal health.
### What you think should happen instead
Undefined errors can happen in an application as complicated as Airflow, but
the fact that the healthcheck returned a successful response is a bit
troublesome as it makes readiness checks difficult to rely on.
Regrettably, I was not able to verify if the health check was still
returning a successful response after the readiness check, but I'm opening this
issue to see if anyone else has faced similar issues before, or has any ideas
on what could be causing the issue.
### How to reproduce
The issue is difficult to reproduce - it is the first time I have seen this
issue in 3 months of using Airflow.
### Operating System
Red Hat Enterprise Linux Server 7.6 (Maipo)
### Versions of Apache Airflow Providers
_No response_
### Deployment
Virtualenv installation
### Deployment details
_No response_
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]