edikmkoyan opened a new issue #14175:
URL: https://github.com/apache/airflow/issues/14175
I have an AKS deployed airflow v2.0.0 with a Kubernetes Executor enabled and
the KubernetesJobWatcher is failing periodically.
**Apache Airflow version**: 2.0.0
**Kubernetes version (if you are using kubernetes)** (use `kubectl version`):
% kc version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0",
GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean",
BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc",
Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6",
GitCommit:"1994a5495a40a663921c5ecfee7dd9a8c61704fa", GitTreeState:"clean",
BuildDate:"2020-07-23T22:06:44Z", GoVersion:"go1.13.6", Compiler:"gc",
Platform:"linux/amd64"}
**Environment**:
- **Cloud provider or hardware configuration**: AKS
- **OS** (e.g. from /etc/os-release):
- **Kernel** (e.g. `uname -a`): 20.2.0 Darwin Kernel Version 20.2.0: Wed Dec
2 20:39:59 PST 2020; root:xnu-7195.60.75~1/RELEASE_X86_64 x86_64
- **Install tools**:
- **Others**:
**What happened**:
```
[2021-02-10 15:33:34,756] {kubernetes_executor.py:111} ERROR - Unknown error
in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
436, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
763, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 103, in run
kube_client, self.resource_version, self.scheduler_job_id,
self.kube_config
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 145, in _run
for event in list_worker_pods():
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 144, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104,
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
Process KubernetesJobWatcher-3:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
436, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
763, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 103, in run
kube_client, self.resource_version, self.scheduler_job_id,
self.kube_config
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 145, in _run
for event in list_worker_pods():
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 144, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104,
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
[2021-02-10 15:33:35,022] {kubernetes_executor.py:266} ERROR - Error while
health checking kube watcher process. Process died for unknown reasons
[2021-02-10 15:37:58,640] {kubernetes_executor.py:111} ERROR - Unknown error
in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
436, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
763, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 103, in run
kube_client, self.resource_version, self.scheduler_job_id,
self.kube_config
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 145, in _run
for event in list_worker_pods():
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 144, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104,
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
Process KubernetesJobWatcher-5:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 313, in recv_into
return self.connection.recv_into(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1840, in recv_into
self._raise_ssl_error(self._ssl, result)
File "/home/airflow/.local/lib/python3.6/site-packages/OpenSSL/SSL.py",
line 1663, in _raise_ssl_error
raise SysCallError(errno, errorcode.get(errno))
OpenSSL.SSL.SysCallError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
436, in _error_catcher
yield
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
763, in read_chunked
self._update_chunk_length()
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
693, in _update_chunk_length
line = self._fp.fp.readline()
File "/usr/local/lib/python3.6/socket.py", line 586, in readinto
return self._sock.recv_into(b)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/contrib/pyopenssl.py",
line 318, in recv_into
raise SocketError(str(e))
OSError: (104, 'ECONNRESET')
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/multiprocessing/process.py", line 258, in
_bootstrap
self.run()
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 103, in run
kube_client, self.resource_version, self.scheduler_job_id,
self.kube_config
File
"/home/airflow/.local/lib/python3.6/site-packages/airflow/executors/kubernetes_executor.py",
line 145, in _run
for event in list_worker_pods():
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 144, in stream
for line in iter_resp_lines(resp):
File
"/home/airflow/.local/lib/python3.6/site-packages/kubernetes/watch/watch.py",
line 46, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
792, in read_chunked
self._original_response.close()
File "/usr/local/lib/python3.6/contextlib.py", line 99, in __exit__
self.gen.throw(type, value, traceback)
File
"/home/airflow/.local/lib/python3.6/site-packages/urllib3/response.py", line
454, in _error_catcher
raise ProtocolError("Connection broken: %r" % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: OSError("(104,
\'ECONNRESET\')",)', OSError("(104, 'ECONNRESET')",))
[2021-02-10 15:37:59,446] {kubernetes_executor.py:266} ERROR - Error while
health checking kube watcher process. Process died for unknown reasons
edikmkoyan@EMkoyan15052 chart %
edikmkoyan@EMkoyan15052 chart % kv version
zsh: command not found: kv
edikmkoyan@EMkoyan15052 chart % kc version
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.0",
GitCommit:"af46c47ce925f4c4ad5cc8d1fca46c7b77d13b38", GitTreeState:"clean",
BuildDate:"2020-12-08T17:59:43Z", GoVersion:"go1.15.5", Compiler:"gc",
Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"18", GitVersion:"v1.18.6",
GitCommit:"1994a5495a40a663921c5ecfee7dd9a8c61704fa", GitTreeState:"clean",
BuildDate:"2020-07-23T22:06:44Z", GoVersion:"go1.13.6", Compiler:"gc",
Platform:"linux/amd64"}
```
scheduler pods are being recreated. kc logs
pod/airflow2-scheduler-84df66d96f-vphtw scheduler logs the messages above.
How often does this problem occur? Once? Every time etc?
About 2 times in 30 minutes
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]