mrpowerus edited a comment on issue #14974:
URL: https://github.com/apache/airflow/issues/14974#issuecomment-810187906
After debugging the TCP/IP connections, I found that the connection to the
KubeAPI was reset after some minutes of complete inactivity for the
kubernetes.Watcher.stream(). However, the watcher seems to think the connection
is still fine and continues listening for some (unknown) reason and no error
appears.
This would also explain the fact why no logging of the type of `Event:
......` was showing up at some point.
The fix seems to be to reset the watcher.stream, by adding the
`timeout_seconds` argument. This ensures that the connection is restarted after
some time, which keeps the connection alive.
My previous comment about the `ProtocolError` is not correct, as the
KubernetesWatcher Procees did not raise an Exception. (I only assumed so as it
appeared when I was testing my code locally).
This patch seems to solve the problem:
```
--- kubernetes_executor.py 2021-03-30 13:40:10.957157100 +0200
+++ kubernetes_executor.py 2021-03-30 13:45:13.836000000 +0200
@@ -142,7 +142,7 @@
list_worker_pods = functools.partial(
watcher.stream, kube_client.list_namespaced_pod,
self.namespace, **kwargs
)
- for event in list_worker_pods():
+ for event in list_worker_pods(timeout_seconds=60):
task = event['object']
self.log.info('Event: %s had an event of type %s',
task.metadata.name, event['type'])
if event['type'] == 'ERROR':
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]