mrpowerus edited a comment on issue #14974:
URL: https://github.com/apache/airflow/issues/14974#issuecomment-810187906


   After debugging the TCP/IP connections, I found that the connection to the 
KubeAPI was reset after some minutes of complete inactivity for the 
kubernetes.Watcher.stream(). However, the watcher seems to think the connection 
is still fine and continues listening for some (unknown) reason and no error 
appears.
   
   This would also explain the fact why no logging of the type of `Event: 
......` was showing up at some point.
   
   The fix seems to be to reset the watcher.stream, by adding the 
`timeout_seconds` argument. This ensures that the connection is restarted after 
some time, which keeps the connection alive.
   
   My previous comment about the `ProtocolError` is not correct, as the 
KubernetesWatcher Procees did not raise an Exception. (I only assumed so as it 
appeared when I was testing my code locally).
   
   This patch seems to solve the problem:
   
   ```
   --- kubernetes_executor.py   2021-03-30 13:40:10.957157100 +0200
   +++ kubernetes_executor.py   2021-03-30 13:45:13.836000000 +0200
   @@ -142,7 +142,7 @@
                list_worker_pods = functools.partial(
                    watcher.stream, kube_client.list_namespaced_pod, 
self.namespace, **kwargs
                )
   -        for event in list_worker_pods():
   +        for event in list_worker_pods(timeout_seconds=60):
                task = event['object']
                self.log.info('Event: %s had an event of type %s', 
task.metadata.name, event['type'])
                if event['type'] == 'ERROR':
   
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to