syun64 opened a new issue, #30170:
URL: https://github.com/apache/airflow/issues/30170

   ### Apache Airflow version
   
   Other Airflow 2 version (please specify below)
   
   ### What happened
   
   Airflow version 2.5.0
   
   On reboot, the Airflow scheduler went into a bad state where the scheduler 
loop crashed when it tried to queue task_instances. Interestingly, it returned 
a 200 response on the REST healthcheck endpoint just 15 seconds later, 
signaling that the scheduler was healthy.
   
   `{"metadatabase": {"status": "healthy"}, "scheduler": 
{"latest_scheduler_heartbeat": "2023-03-16T21:37:30.551002+00:00", "status": 
"healthy"}}`
   
   Timetamp on traceback = 2023-03-16 17:37:15.865
   ```
   Traceback (most recent call last):
     File "/airflow/__main__.py", line 43, in <module>
       main()
     File "/airflow/__main__.py", line 39, in main
       args.func(args)
     File "/airflow/cli/cli_parser.py", line 52, in command
       return func(*args, **kwargs)
     File "/airflow/utils/cli.py", line 108, in wrapper
       return f(*args, **kwargs)
     File "/airflow/cli/commands/scheduler_command.py", line 73, in scheduler
       _run_scheduler_job(args=args)
     File "/airflow/cli/commands/scheduler_command.py", line 43, in 
_run_scheduler_job
       job.run()
     File "/airflow/jobs/base_job.py", line 247, in run
       self._execute()
     File "/airflow/jobs/scheduler_job.py", line 759, in _execute
       self._run_scheduler_loop()
     File "/airflow/jobs/scheduler_job.py", line 885, in _run_scheduler_loop
       num_queued_tis = self._do_scheduling(session)
     File "/airflow/jobs/scheduler_job.py", line 991, in _do_scheduling
       num_queued_tis = 
self._critical_section_enqueue_task_instances(session=session)
     File "/airflow/jobs/scheduler_job.py", line 582, in 
_critical_section_enqueue_task_instances
       queued_tis = self._executable_task_instances_to_queued(max_tis, 
session=session)
     File "/airflow/jobs/scheduler_job.py", line 340, in 
_executable_task_instances_to_queued
       task_instances_to_examine: list[TI] = with_row_locks(
     File "/sqlalchemy/orm/query.py", line 2772, in all
       return self._iter().all()
     File "/sqlalchemy/orm/query.py", line 2915, in _iter
       result = self.session.execute(
     File "/sqlalchemy/orm/session.py", line 1717, in execute
       result = compile_state_cls.orm_setup_cursor_result(
     File "/sqlalchemy/orm/context.py", line 349, in orm_setup_cursor_result
       return loading.instances(result, querycontext)
     File "/sqlalchemy/orm/loading.py", line 89, in instances
       cursor.close()
     File "/sqlalchemy/util/langhelpers.py", line 70, in __exit__
       compat.raise_(
     File "/sqlalchemy/util/compat.py", line 210, in raise_
       raise exception
     File "/sqlalchemy/orm/loading.py", line 69, in instances
       *[
     File "/sqlalchemy/orm/loading.py", line 70, in <listcomp>
       query_entity.row_processor(context, cursor)
     File "/sqlalchemy/orm/context.py", line 2631, in row_processor
       _instance = loading._instance_processor(
     File "/sqlalchemy/orm/loading.py", line 715, in _instance_processor
       primary_key_getter = result._tuple_getter(pk_cols)
     File "/sqlalchemy/engine/result.py", line 961, in _tuple_getter
       return self._metadata._row_as_tuple_getter(keys)
     File "/sqlalchemy/engine/result.py", line 106, in _row_as_tuple_getter
       indexes = self._indexes_for_keys(keys)
     File "/sqlalchemy/engine/cursor.py", line 669, in _indexes_for_keys
       CursorResultMetaData._key_fallback(self, ke.args[0], ke)
     File "/sqlalchemy/engine/cursor.py", line 628, in _key_fallback
       util.raise_(
     File "/sqlalchemy/util/compat.py", line 210, in raise_
       raise exception
   sqlalchemy.exc.NoSuchColumnError: Could not locate column in row for column 
'task_instance.dag_id'
   ```
   
   Simply bouncing the scheduler brought it back up to normal health.
   
   ### What you think should happen instead
   
   Undefined errors can happen in an application as complicated as Airflow, but 
the fact that the healthcheck returned a successful response is a bit 
troublesome as it makes readiness checks difficult to rely on.
   
   Regrettably, I was not able to verify if the health check was still 
returning a successful response after the readiness check, but I'm opening this 
issue to see if anyone else has faced similar issues before, or has any ideas 
on what could be causing the issue.
   
   ### How to reproduce
   
   The issue is difficult to reproduce - it is the first time I have seen this 
issue in 3 months of using Airflow.
   
   ### Operating System
   
   Red Hat Enterprise Linux Server 7.6 (Maipo)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Virtualenv installation
   
   ### Deployment details
   
   _No response_
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to