GitHub user em-eman created a discussion: Intermittent Airflow 3.1.2 API Timeouts Under High Load Causing Task Failures (KubernetesExecutor)
### Apache Airflow version Other Airflow 2/3 version (please specify below) ### If "Other Airflow 2/3 version" selected, which one? 3.1.2 ### What happened? After upgrading from Airflow 2.11 to Airflow 3.1.2 (same issue happened in all 3.x version), we are experiencing intermittent task failures when the system is under high load. The failures occur due to timeouts and name-resolution errors in the Airflow execution API, causing tasks to fail during log setup (_remote_logging_conn → client.connections.get()). When load is low or when we re-run the same DAGs individually, everything succeeds. The problem appears only when many DAGs run concurrently. We are running Airflow on Kubernetes with the KubernetesExecutor. <h3 data-start="805" data-end="826"><strong data-start="809" data-end="826">Cluster Setup</strong></h3> **Scheduler : 2 replica counts Webserver / API : 2 replica counts Dag-Processor : 1 Workers | KubernetesExecutor (pods)** Worker pods fail early in task execution with repeated retries from the Airflow SDK client, eventually raising: httpx.ConnectError: [Errno -3] Temporary failure in name resolution The worker attempts to contact the Airflow Webserver API at: http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/ During high DAG concurrency, the request repeatedly times out, then fails. ``` 025-11-18T09:17:52.005158Z] {{configuration.py:871}} DEBUG - Could not retrieve value from section database, for key sql_alchemy_engine_args. Skipping redaction of this conf. [2025-11-18T09:17:52.005754Z] {{configuration.py:871}} DEBUG - Could not retrieve value from section database, for key sql_alchemy_conn_async. Skipping redaction of this conf. {"timestamp":"2025-11-18T09:17:52.168665Z","level":"info","event":"Executing workload","workload":"ExecuteTask(token='eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJzdWIiOiIwMTlhOTYzZi1lYjY1LTc0NGUtYTgzYi1kZDQ2ZTE4MjI5NDEiLCJqdGkiOiI4MTFhOTcyNzI0MzE0YjIzODRlYzQ3MTI5MjUxOTUxNSIsImlzcyI6Imh0dHBzOi8vYWlyZmxvdy13cy1uYXYtODY2Mi1wci53cy5uYXZpZ2FuY2UuY29tIiwiYXVkIjoidXJuOmFpcmZsb3cuYXBhY2hlLm9yZzp0YXNrIiwibmJmIjoxNzYzNDU3MzYzLCJleHAiOjE3NjM0NTc5NjMsImlhdCI6MTc2MzQ1NzM2M30.OB37xF8UgEjLB4FeDu6nno0RnmUOx8GWcf4Pvmj-N5Q', ti=TaskInstance(id=UUID('019a963f-eb65-744e-a83b-dd46e1822941'), dag_version_id=UUID('019a9525-3bf5-7c2f-9f58-4360d263ee9f'), task_id='unique_id_generator', dag_id='today_data_generator_sp_demo_c2_tailend_v2', run_id='manual__2025-11-18T09:16:01+00:00', try_number=1, map_index=-1, pool_slots=1, queue='kubernetes', priority_weight=5, executor_config=None, parent_context_carrier={}, context_carrier={}), dag_rel_path=PurePosixPath('plant/today_data_generator_v2.py'), bundle_info=BundleIn fo(name='dags-folder', version=None), log_path='dag_id=today_data_generator_sp_demo_c2_tailend_v2/run_id=manual__2025-11-18T09:16:01+00:00/task_id=unique_id_generator/attempt=1.log', type='ExecuteTask')","logger":"__main__","filename":"execute_workload.py","lineno":56} {"timestamp":"2025-11-18T09:17:52.169263Z","level":"info","event":"Connecting to server:","server":"http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/","logger":"__main__","filename":"execute_workload.py","lineno":64} {"timestamp":"2025-11-18T09:17:52.221962Z","level":"debug","event":"Connecting to execution API server","server":"http://airflow-web.ws-nav-8662-pr.svc.cluster.local/execution/","logger":"supervisor","filename":"supervisor.py","lineno":1920} {"timestamp":"2025-11-18T09:18:12.240534Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 1st time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42} {"timestamp":"2025-11-18T09:18:32.959146Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 2nd time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42} {"timestamp":"2025-11-18T09:18:54.560692Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 3rd time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42} {"timestamp":"2025-11-18T09:19:16.095223Z","level":"warning","event":"Starting call to 'airflow.sdk.api.client.Client.request', this is the 4th time calling it.","logger":"airflow.sdk.api.client","filename":"before.py","lineno":42} Traceback (most recent call last): File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 101, in map_httpcore_exceptions yield File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 250, in handle_request resp = self._pool.handle_request(req) File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 256, in handle_request raise exc from None File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection_pool.py", line 236, in handle_request response = connection.handle_request( File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 101, in handle_request raise exc File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 78, in handle_request stream = self._connect(request) File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_sync/connection.py", line 124, in _connect stream = self._network_backend.connect_tcp(**kwargs) File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_backends/sync.py", line 207, in connect_tcp with map_exceptions(exc_map): File "/usr/python/lib/python3.10/contextlib.py", line 153, in __exit__ self.gen.throw(typ, value, traceback) File "/home/airflow/.local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions raise to_exc(exc) from exc httpcore.ConnectError: [Errno -3] Temporary failure in name resolution The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/usr/python/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/python/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/execute_workload.py", line 125, in <module> main() File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/execute_workload.py", line 121, in main execute_workload(workload) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/execute_workload.py", line 66, in execute_workload supervise( File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 1928, in supervise logger, log_file_descriptor = _configure_logging(log_path, client) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 1843, in _configure_logging with _remote_logging_conn(client): File "/usr/python/lib/python3.10/contextlib.py", line 135, in __enter__ return next(self.gen) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 898, in _remote_logging_conn conn = _fetch_remote_logging_conn(conn_id, client) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/execution_time/supervisor.py", line 862, in _fetch_remote_logging_conn conn = client.connections.get(conn_id) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/api/client.py", line 361, in get resp = self.client.get(f"connections/{conn_id}") File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 1053, in get return self.request( File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 336, in wrapped_f return copy(f, *args, **kw) File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 475, in __call__ do = self.iter(retry_state=retry_state) File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 376, in iter result = action(retry_state) File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 418, in exc_check raise retry_exc.reraise() File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 185, in reraise raise self.last_attempt.result() File "/usr/python/lib/python3.10/concurrent/futures/_base.py", line 451, in result return self.__get_result() File "/usr/python/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result raise self._exception File "/home/airflow/.local/lib/python3.10/site-packages/tenacity/__init__.py", line 478, in __call__ result = fn(*args, **kwargs) File "/home/airflow/.local/lib/python3.10/site-packages/airflow/sdk/api/client.py", line 885, in request return super().request(*args, **kwargs) File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 825, in request return self.send(request, auth=auth, follow_redirects=follow_redirects) File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 914, in send response = self._send_handling_auth( File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 942, in _send_handling_auth response = self._send_handling_redirects( File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 979, in _send_handling_redirects response = self._send_single_request(request) File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_client.py", line 1014, in _send_single_request response = transport.handle_request(request) File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 249, in handle_request with map_httpcore_exceptions(): File "/usr/python/lib/python3.10/contextlib.py", line 153, in __exit__ self.gen.throw(typ, value, traceback) File "/home/airflow/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 118, in map_httpcore_exceptions raise mapped_exc(message) from exc httpx.ConnectError: [Errno -3] Temporary failure in name resolution ``` Airflow config: ``` AIRFLOW__SDK__CONNECTION_URL: "http://airflow-web.${NEW_NS}.svc.cluster.local/execution/" AIRFLOW__CORE__EXECUTION_API_SERVER_URL: "http://airflow-web.${NEW_NS}.svc.cluster.local/execution/" AIRFLOW__SCHEDULER__TASK_INSTANCE_HEARTBEAT_SEC: "300" AIRFLOW__SCHEDULER__TASK_INSTANCE_HEARTBEAT_TIMEOUT: "300" AIRFLOW__SCHEDULER__TASK_INSTANCE_HEARTBEAT_TIMEOUT_DETECTION_INTERVAL: "5" AIRFLOW__DAG_PROCESSOR__DAG_FILE_PROCESSOR_TIMEOUT: "600" AIRFLOW__WORKERS__MAX_FAILED_HEARTBEATS: "10" AIRFLOW__WORKERS__MIN_HEARTBEAT_INTERVAL: "120" AIRFLOW__SCHEDULER__SCHEDULER_HEALTH_CHECK_THRESHOLD: "120" AIRFLOW__SCHEDULER__SCHEDULER_HEARTBEAT_SEC: "30" AIRFLOW__SCHEDULER__TASK_QUEUED_TIMEOUT: "1500" AIRFLOW__DAG_PROCESSOR__REFRESH_INTERVAL: "60" AIRFLOW__API__LOG_FETCH_TIMEOUT_SEC: "30" AIRFLOW__WEBSERVER__LOG_FETCH_DELAY_SEC: "10" AIRFLOW__WORKERS__EXECUTION_API_TIMEOUT: "15.0" AIRFLOW__WORKERS__MAX_FAILED_HEARTBEATS: "5" AIRFLOW__CORE__DEFAULT_TASK_RETRY_DELAY: "30" AIRFLOW__CORE__PARALLELISM: "50" ``` ### What you think should happen instead? Tasks should run as expected without timeout error. ### How to reproduce when more than 10 tasks pods are created then few them failed with above error massage. ### Operating System linux ### Versions of Apache Airflow Providers _No response_ ### Deployment Other 3rd-party Helm chart ### Deployment details helm chart and eks cluster ### Anything else? _No response_ ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) GitHub link: https://github.com/apache/airflow/discussions/58453 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
