DonHaul opened a new issue, #61521:
URL: https://github.com/apache/airflow/issues/61521

   ### Apache Airflow version
   
   Other Airflow 3 version (please specify below)
   
   ### If "Other Airflow 3 version" selected, which one?
   
   3.1.6
   
   ### What happened?
   
   We have this issue where a tasks gets suddenly killed by out of memory ( 
`exit_code=-9` ) without showing it in the UIs task logs.
   In some case we do see a CRITICAL log error for a give task saying that it 
indeed exited with code -9. In other cases an empty log is shown (for attempt 
2):
   
   <img width="1231" height="489" alt="Image" 
src="https://github.com/user-attachments/assets/45933716-9de4-4e63-9c35-ac6965216f2c";
 />
   
   If i check the logs for attempt1 I see, which also does not specify anything 
about why this worker was killed
   ```
   airflow@airflow-worker-1:/opt/airflow/logs$ cat 
"/opt/airflow/logs/dag_id=hep_create_dag/run_id=579c45ef-2c6d-471f-a893-1fd4cc26fbb2/task_id=halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories/attempt=1.log"
   
{"timestamp":"2026-02-06T09:37:41.890482Z","level":"warning","event":"pkg_resources
 is deprecated as an API. See 
https://setuptools.pypa.io/en/latest/pkg_resources.html. The pkg_resources 
package is slated for removal as early as 2025-11-30. Refrain from using this 
package or pin to 
Setuptools<81.","category":"UserWarning","filename":"/home/airflow/.local/lib/python3.11/site-packages/inspire_schemas/utils.py","lineno":50,"logger":"py.warnings"}
   {"timestamp":"2026-02-06T09:37:42.406828Z","level":"info","event":"DAG 
bundles loaded: 
dags-folder","logger":"airflow.dag_processing.bundles.manager.DagBundlesManager","filename":"manager.py","lineno":179}
   {"timestamp":"2026-02-06T09:37:42.407277Z","level":"info","event":"Filling 
up the DagBag from 
/opt/airflow/dags/literature/hep_create_dag.py","logger":"airflow.models.dagbag.DagBag","filename":"dagbag.py","lineno":593}
   {"timestamp":"2026-02-06T09:37:46.352446Z","level":"info","event":"AWS 
Connection (conn_id='s3_conn', conn_type='aws') credentials retrieved from 
login and 
password.","logger":"airflow.providers.amazon.aws.utils.connection_wrapper.AwsConnectionWrapper","filename":"connection_wrapper.py","lineno":331}
   {"timestamp":"2026-02-06T09:37:48.926710Z","level":"info","event":"Done. 
Returned value was: 
None","logger":"airflow.task.operators.airflow.providers.standard.decorators.python._PythonDecoratedOperator","filename":"python.py","lineno":217}
   {"timestamp":"2026-02-06T09:37:48.982790Z","level":"error","event":"Top 
level 
error","logger":"task","filename":"task_runner.py","lineno":1482,"error_detail":[{"exc_type":"AirflowRuntimeError","exc_value":"API_SERVER_ERROR:
 {'status_code': 409, 'message': 'Server returned error', 'detail': {'detail': 
{'reason': 'invalid_state', 'message': 'TI was not in the running state so it 
cannot be updated', 'previous_state': 
'failed'}}}","exc_notes":[],"syntax_error":null,"is_cause":false,"frames":[{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1475,"name":"main"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/task_runner.py","lineno":1013,"name":"run"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":207,"name":"send"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","line
 
no":271,"name":"_get_response"},{"filename":"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/comms.py","lineno":258,"name":"_from_frame"}],"is_group":false,"exceptions":[]}]}
   
{"timestamp":"2026-02-06T09:37:49.121613Z","level":"warning","event":"Process 
exited abnormally","exit_code":1,"logger":"task"}
   ```
   
   Only by checking the full worker logs I can understand that indeed the 
process was kille due to OOM:
   ```
   2026-02-06T09:37:41.897416Z [info     ] Task 
execute_workload[c20747ad-5381-4fea-8395-c3aa6a9e92cf] received 
[celery.worker.strategy] loc=strategy.py:161
   2026-02-06T09:37:41.919012Z [info     ] 
[c20747ad-5381-4fea-8395-c3aa6a9e92cf] Executing workload in Celery: 
token='eyJ***' ti=TaskInstance(id=UUID('019c30bc-6c04-7dba-8588-3c592e04a708'), 
dag_version_id=UUID('019c299c-4324-7618-8b47-ed60ca990ba7'), 
task_id='halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories',
 dag_id='hep_create_dag', run_id='579c45ef-2c6d-471f-a893-1fd4cc26fbb2', 
try_number=2, map_index=-1, pool_slots=1, queue='default', priority_weight=32, 
executor_config=None, parent_context_carrier={}, context_carrier={}) 
dag_rel_path=PurePosixPath('literature/hep_create_dag.py') 
bundle_info=BundleInfo(name='dags-folder', version=None) 
log_path='dag_id=hep_create_dag/run_id=579c45ef-2c6d-471f-a893-1fd4cc26fbb2/task_id=halt_for_approval_if_new_or_reject_if_not_relevant.update_inspire_categories/attempt=2.log'
 type='ExecuteTask' [airflow.providers.celery.executors.celery_executor_utils] 
loc=celery_executor_utils.py:156
   2026-02-06T09:37:41.952184Z [info     ] Secrets backends loaded for worker 
[supervisor] backend_classes=['EnvironmentVariablesBackend'] count=1 
loc=supervisor.py:1975
   2026-02-06T09:37:42.004915Z [info     ] Process exited                 
[supervisor] exit_code=-9 loc=supervisor.py:710 pid=836631 signal_sent=SIGKILL
   2026-02-06T09:37:42.017909Z [error    ] Task 
execute_workload[c20747ad-5381-4fea-8395-c3aa6a9e92cf] raised unexpected: 
ServerResponseError('Server returned error') [celery.app.trace] loc=trace.py:285
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.11/site-packages/celery/app/trace.py", line 
479, in trace_task
       R = retval = fun(*args, **kwargs)
                    ^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/celery/app/trace.py", line 
779, in __protected_call__
       return self.run(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/providers/celery/executors/celery_executor_utils.py",
 line 164, in execute_workload
       supervise(
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 1984, in supervise
       process = ActivitySubprocess.start(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 955, in start
       proc._on_child_started(ti=what, dag_rel_path=dag_rel_path, 
bundle_info=bundle_info)
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/execution_time/supervisor.py",
 line 966, in _on_child_started
       ti_context = self.client.task_instances.start(ti.id, self.pid, 
start_date)
                    
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", 
line 215, in start
       resp = self.client.patch(f"task-instances/{id}/run", 
content=body.model_dump_json())
              
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", 
line 1218, in patch
       return self.request(
              ^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 
338, in wrapped_f
       return copy(f, *args, **kw)
              ^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 
477, in __call__
       do = self.iter(retry_state=retry_state)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 
378, in iter
       result = action(retry_state)
                ^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 
400, in <lambda>
       self._add_action_func(lambda rs: rs.outcome.result())
                                        ^^^^^^^^^^^^^^^^^^^
     File "/usr/python/lib/python3.11/concurrent/futures/_base.py", line 449, 
in result
       return self.__get_result()
              ^^^^^^^^^^^^^^^^^^^
     File "/usr/python/lib/python3.11/concurrent/futures/_base.py", line 401, 
in __get_result
       raise self._exception
     File 
"/home/airflow/.local/lib/python3.11/site-packages/tenacity/__init__.py", line 
480, in __call__
       result = fn(*args, **kwargs)
                ^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", 
line 885, in request
       return super().request(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", 
line 825, in request
       return self.send(request, auth=auth, follow_redirects=follow_redirects)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", 
line 914, in send
       response = self._send_handling_auth(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", 
line 942, in _send_handling_auth
       response = self._send_handling_redirects(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
     File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", 
line 999, in _send_handling_redirects
       raise exc
     File "/home/airflow/.local/lib/python3.11/site-packages/httpx/_client.py", 
line 982, in _send_handling_redirects
       hook(response)
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", 
line 186, in raise_on_4xx_5xx_with_note
       return get_json_error(response) or response.raise_for_status()
              ^^^^^^^^^^^^^^^^^^^^^^^^
     File 
"/home/airflow/.local/lib/python3.11/site-packages/airflow/sdk/api/client.py", 
line 176, in get_json_error
       raise err
   airflow.sdk.api.client.ServerResponseError: Server returned error
   Correlation-id=019c3250-7e4c-7c55-b16a-2d607aceb0ef
   ```
   
   ### What you think should happen instead?
   
   An CRITICAL error should be displayed in the ui specifying that this process 
died due to OOM . I've seen this being shown sometimes ( [I see it also 
implemented 
here](https://github.com/apache/airflow/blob/58a28167bb7c511ead89370bbd7ff2f70d86ef58/task-sdk/src/airflow/sdk/execution_time/supervisor.py#L822-L823)
 but there are cases where this code is not reached, as shown above. even 
though we do see in the logs that a SIGKILL happened with exit_code -9
   
   ### How to reproduce
   
   Have many small tasks running on a celery worker that doenst enough memory 
to process them all
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   helm chart depoyed on k8s.
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to