nicolamarangoni opened a new issue, #25343:
URL: https://github.com/apache/airflow/issues/25343

   ### Apache Airflow version
   
   2.3.3 (latest released)
   
   ### What happened
   
   After running for a certain period (few minutes until several hours 
depending on the number of active DAGs in the environment) The scheduler 
crashes with the following error message:
   ```
   [2022-07-26 15:07:24,362] {executor_loader.py:105} INFO - Loaded executor: 
CeleryExecutor
   [2022-07-26 15:07:24,363] {scheduler_job.py:1252} INFO - Resetting orphaned 
tasks for active dag runs
   [2022-07-26 15:07:25,585] {celery_executor.py:532} INFO - Adopted the 
following 1 tasks from a dead executor
       <TaskInstance: freewheel_uafl_data_scala.freewheel.delivery_data 
scheduled__2022-07-25T04:15:00+00:00 [running]> in state STARTED
   [2022-07-26 15:07:35,881] {scheduler_job.py:1381} WARNING - Failing (1) jobs 
without heartbeat after 2022-07-26 12:37:35.868798+00:00
   [2022-07-26 15:07:35,881] {scheduler_job.py:1389} ERROR - Detected zombie 
job: {'full_filepath': '/data/dags/09_scala_apps/freewheel_uafl_data_scala.py', 
'msg': 'Detected <TaskInstance: 
freewheel_uafl_data_scala.freewheel.delivery_data 
scheduled__2022-07-25T04:15:00+00:00 [running]> as zombie', 
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object 
at 0x7fb4a1105690>, 'is_failure_callback': True}
   [2022-07-26 15:07:35,883] {scheduler_job.py:769} ERROR - Exception when 
executing SchedulerJob._run_scheduler_loop
   Traceback (most recent call last):
     File "/usr/lib/python3.10/site-packages/airflow/jobs/scheduler_job.py", 
line 752, in _execute
       self._run_scheduler_loop()
     File "/usr/lib/python3.10/site-packages/airflow/jobs/scheduler_job.py", 
line 873, in _run_scheduler_loop
       next_event = timers.run(blocking=False)
     File "/usr/lib/python3.10/sched.py", line 151, in run
       action(*argument, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/utils/event_scheduler.py", 
line 36, in repeat
       action(*args, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/utils/session.py", line 
71, in wrapper
       return func(*args, session=session, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/jobs/scheduler_job.py", 
line 1390, in _find_zombies
       self.executor.send_callback(request)
     File 
"/usr/lib/python3.10/site-packages/airflow/executors/base_executor.py", line 
363, in send_callback
       self.callback_sink.send(request)
     File "/usr/lib/python3.10/site-packages/airflow/utils/session.py", line 
71, in wrapper
       return func(*args, session=session, **kwargs)
     File 
"/usr/lib/python3.10/site-packages/airflow/callbacks/database_callback_sink.py",
 line 34, in send
       db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
     File "<string>", line 4, in __init__
     File "/usr/lib/python3.10/site-packages/sqlalchemy/orm/state.py", line 
481, in _initialize_instance
       with util.safe_reraise():
     File "/usr/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", 
line 70, in __exit__
       compat.raise_(
     File "/usr/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 
208, in raise_
       raise exception
     File "/usr/lib/python3.10/site-packages/sqlalchemy/orm/state.py", line 
479, in _initialize_instance
       return manager.original_init(*mixed[1:], **kwargs)
     File 
"/usr/lib/python3.10/site-packages/airflow/models/db_callback_request.py", line 
44, in __init__
       self.callback_data = callback.to_json()
     File 
"/usr/lib/python3.10/site-packages/airflow/callbacks/callback_requests.py", 
line 79, in to_json
       return json.dumps(dict_obj)
     File "/usr/lib/python3.10/json/__init__.py", line 231, in dumps
       return _default_encoder.encode(obj)
     File "/usr/lib/python3.10/json/encoder.py", line 199, in encode
       chunks = self.iterencode(o, _one_shot=True)
     File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode
       return _iterencode(o, 0)
     File "/usr/lib/python3.10/json/encoder.py", line 179, in default
       raise TypeError(f'Object of type {o.__class__.__name__} '
   TypeError: Object of type datetime is not JSON serializable
   [2022-07-26 15:07:36,100] {scheduler_job.py:781} INFO - Exited execute loop
   Traceback (most recent call last):
     File "/usr/bin/airflow", line 8, in <module>
       sys.exit(main())
     File "/usr/lib/python3.10/site-packages/airflow/__main__.py", line 38, in 
main
       args.func(args)
     File "/usr/lib/python3.10/site-packages/airflow/cli/cli_parser.py", line 
51, in command
       return func(*args, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/utils/cli.py", line 99, in 
wrapper
       return f(*args, **kwargs)
     File 
"/usr/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py", 
line 75, in scheduler
       _run_scheduler_job(args=args)
     File 
"/usr/lib/python3.10/site-packages/airflow/cli/commands/scheduler_command.py", 
line 46, in _run_scheduler_job
       job.run()
     File "/usr/lib/python3.10/site-packages/airflow/jobs/base_job.py", line 
244, in run
       self._execute()
     File "/usr/lib/python3.10/site-packages/airflow/jobs/scheduler_job.py", 
line 752, in _execute
       self._run_scheduler_loop()
     File "/usr/lib/python3.10/site-packages/airflow/jobs/scheduler_job.py", 
line 873, in _run_scheduler_loop
       next_event = timers.run(blocking=False)
     File "/usr/lib/python3.10/sched.py", line 151, in run
       action(*argument, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/utils/event_scheduler.py", 
line 36, in repeat
       action(*args, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/utils/session.py", line 
71, in wrapper
       return func(*args, session=session, **kwargs)
     File "/usr/lib/python3.10/site-packages/airflow/jobs/scheduler_job.py", 
line 1390, in _find_zombies
       self.executor.send_callback(request)
     File 
"/usr/lib/python3.10/site-packages/airflow/executors/base_executor.py", line 
363, in send_callback
       self.callback_sink.send(request)
     File "/usr/lib/python3.10/site-packages/airflow/utils/session.py", line 
71, in wrapper
       return func(*args, session=session, **kwargs)
     File 
"/usr/lib/python3.10/site-packages/airflow/callbacks/database_callback_sink.py",
 line 34, in send
       db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
     File "<string>", line 4, in __init__
     File "/usr/lib/python3.10/site-packages/sqlalchemy/orm/state.py", line 
481, in _initialize_instance
       with util.safe_reraise():
     File "/usr/lib/python3.10/site-packages/sqlalchemy/util/langhelpers.py", 
line 70, in __exit__
       compat.raise_(
     File "/usr/lib/python3.10/site-packages/sqlalchemy/util/compat.py", line 
208, in raise_
       raise exception
     File "/usr/lib/python3.10/site-packages/sqlalchemy/orm/state.py", line 
479, in _initialize_instance
       return manager.original_init(*mixed[1:], **kwargs)
     File 
"/usr/lib/python3.10/site-packages/airflow/models/db_callback_request.py", line 
44, in __init__
       self.callback_data = callback.to_json()
     File 
"/usr/lib/python3.10/site-packages/airflow/callbacks/callback_requests.py", 
line 79, in to_json
       return json.dumps(dict_obj)
     File "/usr/lib/python3.10/json/__init__.py", line 231, in dumps
       return _default_encoder.encode(obj)
     File "/usr/lib/python3.10/json/encoder.py", line 199, in encode
       chunks = self.iterencode(o, _one_shot=True)
     File "/usr/lib/python3.10/json/encoder.py", line 257, in iterencode
       return _iterencode(o, 0)
     File "/usr/lib/python3.10/json/encoder.py", line 179, in default
       raise TypeError(f'Object of type {o.__class__.__name__} '
   TypeError: Object of type datetime is not JSON serializable
   ```
   
   ### What you think should happen instead
   
   The scheduler should handle zombie jobs without crashing.
   
   ### How to reproduce
   
   The following conditions are necessary:
   
   - dag-processor and scheduler run in separated containers
   - AirFlow uses the CeleryExecutor
   - There are zombie jobs 
   
   ### Operating System
   
   Alpine Linux 3.16.1
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-apache-hdfs==3.0.1
   apache-airflow-providers-celery==3.0.0
   apache-airflow-providers-cncf-kubernetes==4.2.0
   apache-airflow-providers-common-sql==1.0.0
   apache-airflow-providers-datadog==3.0.0
   apache-airflow-providers-exasol==2.1.3
   apache-airflow-providers-ftp==3.1.0
   apache-airflow-providers-http==4.0.0
   apache-airflow-providers-imap==3.0.0
   apache-airflow-providers-jenkins==3.0.0
   apache-airflow-providers-microsoft-mssql==3.1.0
   apache-airflow-providers-odbc==3.1.0
   apache-airflow-providers-oracle==3.1.0
   apache-airflow-providers-postgres==5.1.0
   apache-airflow-providers-redis==3.0.0
   apache-airflow-providers-slack==5.1.0
   apache-airflow-providers-sqlite==3.1.0
   apache-airflow-providers-ssh==3.1.0
   ```
   
   ### Deployment
   
   Other 3rd-party Helm chart
   
   ### Deployment details
   
   One Pod on Kubernetes containing the following containers
   
   - 1 Container for the webserver service
   - 1 Container for the scheduler service
   - 1 Container for the dag-processor service
   - 1 Container for the flower service
   - 1 Container for the redis service
   - 2 or 3 containers for the celery workers services
   
   Due to a previous issue crashing the scheduler with the message `UNEXPECTED 
COMMIT - THIS WILL BREAK HA LOCKS`, we substitute `scheduler_job.py` with the 
file 
`https://raw.githubusercontent.com/tanelk/airflow/a4b22932e5ac9c2b6f37c8c58345eee0f63cae09/airflow/jobs/scheduler_job.py`.
   Sadly I don't remember which issue or MR exactly but it was related to 
scheduler and dag-processor running in separate containers.
   
   ### Anything else
   
   It looks like that only **the combination of CeleryExecutor and separated 
scheduler and dag-processor** services crashes the scheduler when handling 
zombie jobs.
   The KubernetesExecutor with separated scheduler and dag-processor doesn't 
crash the scheduler.
   The CeleryExecutor with scheduler and dag-processor in the same container 
has still to be tested. I will post an update if I manage to test this scenario.
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to