dadonnelly316 opened a new issue, #28836:
URL: https://github.com/apache/airflow/issues/28836

   ### Apache Airflow version
   
   2.5.0
   
   ### What happened
   
   The airflow scheduler makes a call the the K8 API to create pod for a task 
run, but returns a 400+ http response code. This causes all subsequent airflow 
tasks to be stuck in "queued" or "scheduled" state. The scheduler must be 
restarted for tasks to enter the running state. 
   
   
   Similar to #28328, but not seeing the ConnectionResetError exception when 
calling Executor.end 
   
   
   ```airflow-scheduler Exception when attempting to create Namespaced Pod
   airflow-scheduler  Traceback (most recent call last):
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 269, in run_pod_async
       resp = self.kube_client.create_namespaced_pod(
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", 
line 7356, in create_namespaced_pod
       return self.create_namespaced_pod_with_http_info(namespace, body, 
**kwargs)  # noqa: E501
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", 
line 7455, in create_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 
348, in call_api
       return self.__call_api(resource_path, method,
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 
180, in __call_api
       response_data = self.request(
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 
391, in request
       return self.rest_client.POST(url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", 
line 275, in POST
       return self.request("POST", url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", 
line 234, in request
       raise ApiException(http_resp=r)
   kubernetes.client.exceptions.ApiException: (500)
   airflow-scheduler Reason: Internal Server Error
   airflow-scheduler  urllib3.exceptions.ProtocolError: ('Connection aborted.', 
RemoteDisconnected('Remote end closed connection without response'))
   Exception when executing SchedulerJob._run_scheduler_loop
   airflow-scheduler Traceback (most recent call last):
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 703, in urlopen
       httplib_response = self._make_request(
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 449, in _make_request
       six.raise_from(e, None)
     File "<string>", line 3, in raise_from
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 444, in _make_request
       httplib_response = conn.getresponse()
     File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
       response.begin()
     File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
       version, status, reason = self._read_status()
     File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status
       raise RemoteDisconnected("Remote end closed connection without"
   airflow-scheduler  http.client.RemoteDisconnected: Remote end closed 
connection without response
   airflow-scheduler  During handling of the above exception, another exception 
occurred:
   airflow-scheduler Traceback (most recent call last):
     File 
"/usr/local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 
759, in _execute
       self._run_scheduler_loop()
     File 
"/usr/local/lib/python3.9/site-packages/airflow/jobs/scheduler_job.py", line 
887, in _run_scheduler_loop
       self.executor.heartbeat()
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/base_executor.py", 
line 175, in heartbeat
       self.sync()
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 632, in sync
       self.kube_scheduler.run_next(task)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 344, in run_next
       self.run_pod_async(pod, **self.kube_config.kube_client_request_args)
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 275, in run_pod_async
       raise e
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 269, in run_pod_async
       resp = self.kube_client.create_namespaced_pod(
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", 
line 7356, in create_namespaced_pod
       return self.create_namespaced_pod_with_http_info(namespace, body, 
**kwargs)  # noqa: E501
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api/core_v1_api.py", 
line 7455, in create_namespaced_pod_with_http_info
       return self.api_client.call_api(
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 
348, in call_api
       return self.__call_api(resource_path, method,
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 
180, in __call_api
       response_data = self.request(
     File 
"/usr/local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 
391, in request
       return self.rest_client.POST(url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", 
line 275, in POST
       return self.request("POST", url,
     File "/usr/local/lib/python3.9/site-packages/kubernetes/client/rest.py", 
line 168, in request
       r = self.pool_manager.request(
     File "/usr/local/lib/python3.9/site-packages/urllib3/request.py", line 78, 
in request
       return self.request_encode_body(
     File "/usr/local/lib/python3.9/site-packages/urllib3/request.py", line 
170, in request_encode_body
       return self.urlopen(method, url, **extra_kw)
     File "/usr/local/lib/python3.9/site-packages/urllib3/poolmanager.py", line 
376, in urlopen
       response = conn.urlopen(method, u.request_uri, **kw)
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 787, in urlopen
       retries = retries.increment(
     File "/usr/local/lib/python3.9/site-packages/urllib3/util/retry.py", line 
550, in increment
       raise six.reraise(type(error), error, _stacktrace)
     File "/usr/local/lib/python3.9/site-packages/urllib3/packages/six.py", 
line 769, in reraise
       raise value.with_traceback(tb)
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 703, in urlopen
       httplib_response = self._make_request(
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 449, in _make_request
       six.raise_from(e, None)
     File "<string>", line 3, in raise_from
     File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", 
line 444, in _make_request
       httplib_response = conn.getresponse()
     File "/usr/local/lib/python3.9/http/client.py", line 1377, in getresponse
       response.begin()
     File "/usr/local/lib/python3.9/http/client.py", line 320, in begin
       version, status, reason = self._read_status()
     File "/usr/local/lib/python3.9/http/client.py", line 289, in _read_status
       raise RemoteDisconnected("Remote end closed connection without"
   airflow-scheduler urllib3.exceptions.ProtocolError: ('Connection aborted.', 
RemoteDisconnected('Remote end closed connection without response'))
   airflow-scheduler error  Unknown error in KubernetesJobWatcher. Failing
   airflow-scheduler Traceback (most recent call last):
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 104, in run
       self.resource_version = self._run(
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 166, in _run
       self.process_status(
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 218, in process_status
       self.watcher_queue.put((pod_id, namespace, State.FAILED, annotations, 
resource_version))
     File "<string>", line 2, in put
     File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in 
_callmethod
       conn.send((self._id, methodname, args, kwds))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, 
in send
       self._send_bytes(_ForkingPickler.dumps(obj))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, 
in _send_bytes
       self._send(header + buf)
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, 
in _send
       n = write(self._handle, buf)
   airflow-scheduler BrokenPipeError: [Errno 32] Broken pipe
   airflow-scheduler Process KubernetesJobWatcher-5:
   airflow-scheduler Traceback (most recent call last):
     File "/usr/local/lib/python3.9/multiprocessing/process.py", line 315, in 
_bootstrap
       self.run()
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 104, in run
       self.resource_version = self._run(
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 166, in _run
       self.process_status(
     File 
"/usr/local/lib/python3.9/site-packages/airflow/executors/kubernetes_executor.py",
 line 218, in process_status
       self.watcher_queue.put((pod_id, namespace, State.FAILED, annotations, 
resource_version))
     File "<string>", line 2, in put
     File "/usr/local/lib/python3.9/multiprocessing/managers.py", line 809, in 
_callmethod
       conn.send((self._id, methodname, args, kwds))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 206, 
in send
       self._send_bytes(_ForkingPickler.dumps(obj))
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 411, 
in _send_bytes
       self._send(header + buf)
     File "/usr/local/lib/python3.9/multiprocessing/connection.py", line 368, 
in _send
       n = write(self._handle, buf)
   airflow-scheduler BrokenPipeError: [Errno 32] Broken pipe```
   
   ### What you think should happen instead
   
   Handle ApiException  - we've this error for multiple 4XX and 5XX response 
codes.
   
   ### How to reproduce
   
   _No response_
   
   ### Operating System
   
   Debian GNU/Linux 11 (bullseye)
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Other
   
   ### Deployment details
   
   K8 deployment
   
   ### Anything else
   
   It's difficult to tell how often this issue occurs since it can go unnoticed 
in a CI environment where the scheduler is often restarted. 
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to