meetri opened a new issue, #24538:
URL: https://github.com/apache/airflow/issues/24538
### Apache Airflow version
2.3.2 (latest released)
### What happened
The scheduler crashes with the following exception. Once the scheduler
crashes restarts will cause it to immediately crash again. To get scheduler
back working. All dags must be paused and all tasks that are running need to
have it's state changed to up for retry. This is something we just started
noticing after switching to the CeleryKubernetesExecutor.
```
[2022-06-16 20:12:04,535] {scheduler_job.py:1350} WARNING - Failing (3) jobs
without heartbeat after 2022-06-16 20:07:04.512590+00:00
[2022-06-16 20:12:04,535] {scheduler_job.py:1358} ERROR - Detected zombie
job: {'full_filepath': '/airflow-efs/dags/Scanner.py', 'msg': 'Detected
<TaskInstance: lmnop-domain-scanner.Macadocious
manual__2022-06-16T02:27:36.281445+00:00 [running]> as zombie',
'simple_task_instance': <airflow.models.taskinstance.SimpleTaskInstance object
at 0x7f96de2fc890>, 'is_failure_callback': True}
[2022-06-16 20:12:04,537] {scheduler_job.py:756} ERROR - Exception when
executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
line 739, in _execute
self._run_scheduler_loop()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
line 839, in _run_scheduler_loop
next_event = timers.run(blocking=False)
File "/usr/local/lib/python3.7/sched.py", line 151, in run
action(*argument, **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line
36, in repeat
action(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line
71, in wrapper
return func(*args, session=session, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
line 1359, in _find_zombies
self.executor.send_callback(request)
File
"/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py",
line 218, in send_callback
self.callback_sink.send(request)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line
71, in wrapper
return func(*args, session=session, **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py",
line 34, in send
db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
File "<string>", line 4, in __init__
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line
437, in _initialize_instance
manager.dispatch.init_failure(self, args, kwargs)
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py",
line 72, in __exit__
with_traceback=exc_tb,
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line
211, in raise_
raise exception
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line
434, in _initialize_instance
return manager.original_init(*mixed[1:], **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py",
line 44, in __init__
self.callback_data = callback.to_json()
File
"/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py",
line 79, in to_json
return json.dumps(dict_obj)
File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
[2022-06-16 20:12:04,573] {kubernetes_executor.py:813} INFO - Shutting down
Kubernetes executor
[2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor
shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor',
task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.707461+00:00',
try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor',
'launch-xyz-pod', 'manual__2022-06-16T19:53:04.707461+00:00', '--local',
'--subdir', 'DAGS_FOLDER/lmnop.py'], None, None)
[2022-06-16 20:12:04,574] {kubernetes_executor.py:773} WARNING - Executor
shutting down, will NOT run task=(TaskInstanceKey(dag_id='lmnop-processor',
task_id='launch-xyz-pod', run_id='manual__2022-06-16T19:53:04.831929+00:00',
try_number=1, map_index=-1), ['airflow', 'tasks', 'run', 'lmnop-processor',
'launch-xyz-pod', 'manual__2022-06-16T19:53:04.831929+00:00', '--local',
'--subdir', 'DAGS_FOLDER/lmnop.py'], None, None)
[2022-06-16 20:12:04,601] {scheduler_job.py:768} INFO - Exited execute loop
Traceback (most recent call last):
File "/pyroot/bin/airflow", line 8, in <module>
sys.exit(main())
File "/pyroot/lib/python3.7/site-packages/airflow/__main__.py", line 38,
in main
args.func(args)
File "/pyroot/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line
51, in command
return func(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/cli.py", line 99,
in wrapper
return f(*args, **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py",
line 75, in scheduler
_run_scheduler_job(args=args)
File
"/pyroot/lib/python3.7/site-packages/airflow/cli/commands/scheduler_command.py",
line 46, in _run_scheduler_job
job.run()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/base_job.py", line
244, in run
self._execute()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
line 739, in _execute
self._run_scheduler_loop()
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
line 839, in _run_scheduler_loop
next_event = timers.run(blocking=False)
File "/usr/local/lib/python3.7/sched.py", line 151, in run
action(*argument, **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/utils/event_scheduler.py", line
36, in repeat
action(*args, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line
71, in wrapper
return func(*args, session=session, **kwargs)
File "/pyroot/lib/python3.7/site-packages/airflow/jobs/scheduler_job.py",
line 1359, in _find_zombies
self.executor.send_callback(request)
File
"/pyroot/lib/python3.7/site-packages/airflow/executors/celery_kubernetes_executor.py",
line 218, in send_callback
self.callback_sink.send(request)
File "/pyroot/lib/python3.7/site-packages/airflow/utils/session.py", line
71, in wrapper
return func(*args, session=session, **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/callbacks/database_callback_sink.py",
line 34, in send
db_callback = DbCallbackRequest(callback=callback, priority_weight=10)
File "<string>", line 4, in __init__
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line
437, in _initialize_instance
manager.dispatch.init_failure(self, args, kwargs)
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/langhelpers.py",
line 72, in __exit__
with_traceback=exc_tb,
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line
211, in raise_
raise exception
File "/pyroot/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line
434, in _initialize_instance
return manager.original_init(*mixed[1:], **kwargs)
File
"/pyroot/lib/python3.7/site-packages/airflow/models/db_callback_request.py",
line 44, in __init__
self.callback_data = callback.to_json()
File
"/pyroot/lib/python3.7/site-packages/airflow/callbacks/callback_requests.py",
line 79, in to_json
return json.dumps(dict_obj)
File "/usr/local/lib/python3.7/json/__init__.py", line 231, in dumps
return _default_encoder.encode(obj)
File "/usr/local/lib/python3.7/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/usr/local/lib/python3.7/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/usr/local/lib/python3.7/json/encoder.py", line 179, in default
raise TypeError(f'Object of type {o.__class__.__name__} '
TypeError: Object of type datetime is not JSON serializable
```
### What you think should happen instead
The error itself seems like a minor issue and should not happen and easy to
fix. But what seems like a bigger issue is how the scheduler was not able to
recover on it's own and was stuck in an endless restart loop.
### How to reproduce
I'm not sure of the most simple step by step way to reproduce. But the
conditions of my airflow workflow was about 4 active dags chugging through with
about 50 max active runs and 50 concurrent each, with one dag set with 150 max
active runs and 50 concurrent. ( not really that much )
The dag with the 150 max active runs is running the kubernetesExecutor
create a pod in the local kubernetes environment. this I think is the reason
we're seeing this issue all of a sudden.
Hopefully this helps in potentially reproducing it.
### Operating System
Debian GNU/Linux 10 (buster)
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon==3.4.0
apache-airflow-providers-celery==2.1.4
apache-airflow-providers-cncf-kubernetes==4.0.2
apache-airflow-providers-ftp==2.1.2
apache-airflow-providers-http==2.1.2
apache-airflow-providers-imap==2.2.3
apache-airflow-providers-postgres==4.1.0
apache-airflow-providers-redis==2.0.4
apache-airflow-providers-sqlite==2.1.3
### Deployment
Other Docker-based deployment
### Deployment details
we create our own airflow base images using the instructions provided on
your site, here is a snippet of the code we use to install
```
RUN pip3 install
"apache-airflow[statsd,aws,kubernetes,celery,redis,postgres,sentry]==${AIRFLOW_VERSION}"
--constraint
"https://raw.githubusercontent.com/apache/airflow/constraints-$AIRFLOW_VERSION/constraints-$PYTHON_VERSION.txt"
```
We then use this docker image for all of our airflow workers, scheduler,
dagprocessor and airflow web
This is managed through a custom helm script. Also we have incorporated the
use of pgbouncer to manage db connections similar to the publicly available
helm charts
### Anything else
The problem seems to occur quite frequently. It makes the system completely
unusable.
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]