Jeoffreybauvin commented on issue #61521: URL: https://github.com/apache/airflow/issues/61521#issuecomment-4213131037
> ### Apache Airflow version > Other Airflow 3 version (please specify below) > > ### If "Other Airflow 3 version" selected, which one? > 3.1.6 > > ### What happened? > We have this issue where a tasks gets suddenly killed by out of memory ( `exit_code=-9` ) without showing it in the UIs task logs. In some case we do see a CRITICAL log error for a give task saying that it indeed exited with code -9. In other cases an empty log is shown (for attempt 2): > > <img alt="Image" width="1231" height="489" src="https://private-user-images.githubusercontent.com/13081142/546116262-45933716-9de4-4e63-9c35-ac6965216f2c.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NzU3MjczNjYsIm5iZiI6MTc3NTcyNzA2NiwicGF0aCI6Ii8xMzA4MTE0Mi81NDYxMTYyNjItNDU5MzM3MTYtOWRlNC00ZTYzLTljMzUtYWM2OTY1MjE2ZjJjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNjA0MDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjYwNDA5VDA5MzEwNlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTkxOTllMWFiNzEwYTMyYzQ1MzFiNGIwMDMyZTM1ZDdmNjRjMzUzMzE5NWFmYTMyNTc2MDg4MGQyNTM2MjRhNjEmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.OYxkNsU-oP8iyxl2rqCDDDHfZQQe6m1dTAPynAWuOzc"> > If i check the logs for attempt1 I see, which also does not specify anything about why this worker was killed > > ``` > airflow@airflow-worker-1:/opt/airflow/logs$ cat "/opt/airflow/logs/dag_id=hep_create_dag/run_id=579c45ef-2c6d-471f-a893-1fd4cc26fbb2/task_id=halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories/attempt=1.log" > {"timestamp":"2026-02-06T09:37:41.890482Z","level":"warning","event":"pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources package is slated for removal as early as 2025-11-30. Refrain from using this package or pin to Setuptools<81.","category":"UserWarning","filename":"/home/airflow/.local/lib/python3.11/site-packages/inspire_schemas/utils.py","lineno":50,"logger":"py.warnings"} > {"timestamp":"2026-02-06T09:37:42.406828Z","level":"info","event":"DAG bundles loaded: dags-folder","logger":"airflow.dag_processing.bundles.manager.DagBundlesManager","filename":"manager.py","lineno":179} > {"timestamp":"2026-02-06T09:37:42.407277Z","level":"info","event":"Filling up the DagBag from /opt/airflow/dags/literature/hep_create_dag.py","logger":"airflow.models.dagbag.DagBag","filename":"dagbag.py","lineno":593} > {"timestamp":"2026-02-06T09:37:46.352446Z","level":"info","event":"AWS Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from login and password.","logger":"airflow.providers.amazon.aws.utils.connection_wrapper.AwsConnectionWrapper","filename":"connection_wrapper.py","lineno":331} > {"timestamp":"2026-02-06T09:37:48.926710Z","level":"info","event":"Done. Returned value was: None","logger":"airflow.task.operators.airflow.providers.standard.decorators.python._PythonDecoratedOperator","filename":"python.py","lineno":217} > {"timestamp":"2026-02-06T09:37:48.982790Z","level":"error","event":"Top level error","logger":"task","filename":"task_runner.py","lineno":1482,"error_detail":[{"exc_type":"AirflowRuntimeError","exc_value":"API_SERVER_ERROR: {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': {'reason': 'invalid_state', 'message': 'TI was not in the running state so it cannot be updated', 'previous_state': 'failed'}}}","exc_notes":[],"syntax_error":null,"is_cause":false,"frames":[{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1475,"name":"main"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1013,"name":"run"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":207,"name":"send"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","li neno":271,"name":"_get_response"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":258,"name":"_from_frame"}],"is_group":false,"exceptions":[]}]} > {"timestamp":"2026-02-06T09:37:49.121613Z","level":"warning","event":"Process exited abnormally","exit_code":1,"logger":"task"} > ``` > > Only by checking the full worker logs I can understand that indeed the process was kille due to OOM: > > ``` > 2026-02-06T09:37:41.897416Z [info ] Task execute_workload[c20747ad-5381-4fea-8395-c3aa6a9e92cf] received [celery.worker.strategy] loc=strategy.py:161 > 2026-02-06T09:37:41.919012Z [info ] [c20747ad-5381-4fea-8395-c3aa6a9e92cf] Executing workload in Celery: token='eyJ***' ti=TaskInstance(id=UUID('019c30bc-6c04-7dba-8588-3c592e04a708'), dag_version_id=UUID('019c299c-4324-7618-8b47-ed60ca990ba7'), task_id='halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories', dag_id='hep_create_dag', run_id='579c45ef-2c6d-471f-a893-1fd4cc26fbb2', try_number=2, map_index=-1, pool_slots=1, queue='default', priority_weight=32, executor_config=None, parent_context_carrier={}, context_carrier={}) dag_rel_path=PurePosixPath('literature/hep_create_dag.py') bundle_info=BundleInfo(name='dags-folder', version=None) log_path='dag_id=hep_create_dag/run_id=579c45ef-2c6d-471f-a893-1fd4cc26fbb2/task_id=halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories/attempt=2.log' type='ExecuteTask' [airflow.providers.celery.executors.celery_executor_utils] loc=celery_executor_utils.py:156 > 2026-02-06T09:37:41.952184Z [info ] Secrets backends loaded for worker [supervisor] backend_classes=['EnvironmentVariablesBackend'] count=1 loc=supervisor.py:1975 > 2026-02-06T09:37:42.004915Z [info ] Process exited [supervisor] exit_code=-9 loc=supervisor.py:710 pid=836631 signal_sent=SIGKILL > 2026-02-06T09:37:42.017909Z [error ] Task execute_workload[c20747ad-5381-4fea-8395-c3aa6a9e92cf] raised unexpected: ServerResponseError('Server returned error') [celery.app.trace] loc=trace.py:285 > Traceback (most recent call last): > File "/home/airflow/.local/lib/python3.11/site-packages/celery/app/trace.py", line 479, in trace_task > R = retval = fun(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/celery/app/trace.py", line 779, in __protected_call__ > return self.run(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/celery/executors/celery_executor_utils.py", line 164, in execute_workload > supervise( > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py", line 1984, in supervise > process = ActivitySubprocess.start( > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py", line 955, in start > proc._on_child_started(ti=what, dag_rel_path=dag_rel_path, bundle_info=bundle_info) > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py", line 966, in _on_child_started > ti_context = self.client.task_instances.start(ti.id, self.pid, start_date) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 215, in start > resp = self.client.patch(f"task-instances/{id}/run", content=body.model_dump_json()) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 1218, in patch > return self.request( > ^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 338, in wrapped_f > return copy(f, *args, **kw) > ^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 477, in __call__ > do = self.iter(retry_state=retry_state) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 378, in iter > result = action(retry_state) > ^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 400, in <lambda> > self._add_action_func(lambda rs: rs.outcome.result()) > ^^^^^^^^^^^^^^^^^^^ > File "/usr/python/lib/python3.11/concurrent/futures/_base.py", line 449, in result > return self.__get_result() > ^^^^^^^^^^^^^^^^^^^ > File "/usr/python/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result > raise self._exception > File "/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 480, in __call__ > result = fn(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 885, in request > return super().request(*args, **kwargs) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 825, in request > return self.send(request, auth=auth, follow_redirects=follow_redirects) > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 914, in send > response = self._send_handling_auth( > ^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 942, in _send_handling_auth > response = self._send_handling_redirects( > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 999, in _send_handling_redirects > raise exc > File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", line 982, in _send_handling_redirects > hook(response) > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 186, in raise_on_4xx_5xx_with_note > return get_json_error(response) or response.raise_for_status() > ^^^^^^^^^^^^^^^^^^^^^^^^ > File "/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", line 176, in get_json_error > raise err > airflow.sdk.api.client.ServerResponseError: Server returned error > Correlation-id=019c3250-7e4c-7c55-b16a-2d607aceb0ef > ``` > > ### What you think should happen instead? > An CRITICAL error should be displayed in the ui specifying that this process died due to OOM . I've seen this being shown sometimes ( [I see it also implemented here](https://github.com/apache/airflow/blob/58a28167bb7c511ead89370bbd7ff2f70d86ef58/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L822-L823) but there are cases where this code is not reached, as shown above. even though we do see in the logs that a SIGKILL happened with exit_code -9 > > ### How to reproduce > Have many small tasks running on a celery worker that doenst enough memory to process them all > > ### Operating System > Debian GNU/Linux 12 (bookworm) > > ### Versions of Apache Airflow Providers > _No response_ > > ### Deployment > Official Apache Airflow Helm Chart > > ### Deployment details > helm chart depoyed on k8s. > > ### Anything else? > _No response_ > > ### Are you willing to submit PR? > * [ ] Yes I am willing to submit a PR! > > ### Code of Conduct > * [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) Did you get a chance to see why memory is increasing ? I have the same issue here, with a lot of oom kills : <img width="2091" height="1088" alt="Image" src="https://github.com/user-attachments/assets/4cc58c9a-2f1b-47a3-9c32-ee2d98f21db3" /> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
