edikmkoyan opened a new issue #14175:
URL: https://github.com/apache/airflow/issues/14175


   
   
   I have an AKS deployed airflow v2.0.0 with a Kubernetes Executor enabled and 
the KubernetesJobWatcher is failing periodically. 
   
   
   
   
   
   **Apache Airflow version**: 2.0.0
   
   
   **Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
   % kc version
   Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", 
GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", 
BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", 
Platform:"darwin/amd64"}
   Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", 
GitCommit:"1994a5495a40a663921c5ecfee7dd9a8c61704fa", GitTreeState:"clean", 
BuildDate:"2020-07-23T22:06:44Z", GoVersion:"go1.13.6", Compiler:"gc", 
Platform:"linux/amd64"}
   
   **Environment**:
   
   - **Cloud provider or hardware configuration**: AKS
   - **OS** (e.g. from /etc/os-release): 
   - **Kernel** (e.g. `uname -a`): 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec 
 2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64 x86_64
   - **Install tools**: 
   - **Others**:
   
   **What happened**:
   
   ```
   [2021-02-10 15:33:34,756] {kubernetes_executor.py:111} ERROR - Unknown error 
in KubernetesJobWatcher. Failing
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1663, in _raise_ssl_error
       raise SysCallError(errno, errorcode.get(errno))
   OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
436, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
763, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
693, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
       return self._sock.recv_into(b)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 318, in recv_into
       raise SocketError(str(e))
   OSError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 103, in run
       kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 144, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
792, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
454, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, 
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
   Process KubernetesJobWatcher-3:
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1663, in _raise_ssl_error
       raise SysCallError(errno, errorcode.get(errno))
   OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
436, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
763, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
693, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
       return self._sock.recv_into(b)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 318, in recv_into
       raise SocketError(str(e))
   OSError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in 
_bootstrap
       self.run()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 103, in run
       kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 144, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
792, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
454, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, 
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
   [2021-02-10 15:33:35,022] {kubernetes_executor.py:266} ERROR - Error while 
health checking kube watcher process. Process died for unknown reasons
   [2021-02-10 15:37:58,640] {kubernetes_executor.py:111} ERROR - Unknown error 
in KubernetesJobWatcher. Failing
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1663, in _raise_ssl_error
       raise SysCallError(errno, errorcode.get(errno))
   OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
436, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
763, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
693, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
       return self._sock.recv_into(b)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 318, in recv_into
       raise SocketError(str(e))
   OSError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 103, in run
       kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 144, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
792, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
454, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, 
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
   Process KubernetesJobWatcher-5:
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 313, in recv_into
       return self.connection.recv_into(*args, **kwargs)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1840, in recv_into
       self._raise_ssl_error(self._ssl, result)
     File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py", 
line 1663, in _raise_ssl_error
       raise SysCallError(errno, errorcode.get(errno))
   OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
436, in _error_catcher
       yield
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
763, in read_chunked
       self._update_chunk_length()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
693, in _update_chunk_length
       line = self._fp.fp.readline()
     File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
       return self._sock.recv_into(b)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
 line 318, in recv_into
       raise SocketError(str(e))
   OSError: (104, 'ECONNRESET')
   
   During handling of the above exception, another exception occurred:
   
   Traceback (most recent call last):
     File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in 
_bootstrap
       self.run()
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 103, in run
       kube_client, self.resource_version, self.scheduler_job_id, 
self.kube_config
     File 
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
 line 145, in _run
       for event in list_worker_pods():
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 144, in stream
       for line in iter_resp_lines(resp):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py", 
line 46, in iter_resp_lines
       for seg in resp.read_chunked(decode_content=False):
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
792, in read_chunked
       self._original_response.close()
     File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
       self.gen.throw(type, value, traceback)
     File 
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line 
454, in _error_catcher
       raise ProtocolError("Connection broken: %r" % e, e)
   urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104, 
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
   [2021-02-10 15:37:59,446] {kubernetes_executor.py:266} ERROR - Error while 
health checking kube watcher process. Process died for unknown reasons
   edikmkoyan@EMkoyan15052 chart % 
   edikmkoyan@EMkoyan15052 chart % kv version
   zsh: command not found: kv
   edikmkoyan@EMkoyan15052 chart % kc version
   Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0", 
GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean", 
BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc", 
Platform:"darwin/amd64"}
   Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6", 
GitCommit:"1994a5495a40a663921c5ecfee7dd9a8c61704fa", GitTreeState:"clean", 
BuildDate:"2020-07-23T22:06:44Z", GoVersion:"go1.13.6", Compiler:"gc", 
Platform:"linux/amd64"}
   ```
   
   
   scheduler pods are being recreated. kc logs 
pod/airflow2-scheduler-84df66d96f-vphtw scheduler logs the messages above.
   
   
   
   
   
   
   How often does this problem occur? Once? Every time etc?
   
   About 2 times in 30 minutes
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to