guillemborrell opened a new issue #12644:
URL: https://github.com/apache/airflow/issues/12644
**Apache Airflow version**:
1.10.10 and 1.10.12 so far
**What happened**:
We are running Airflow with the KubernetesExecutor on a Kubernetes cluster
with a somewhat unreliable network. The scheduler runs smoothly until it logs
the following line:
```
WARNING - HTTPError when attempting to run task, re-queueing. Exception:
('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
```
From then on, the executor is able to schedule jobs, but the job watcher
stops processing events. The resulting behaviour is that new jobs are
scheduled, but all jobs with status `Completed` remain in the cluster instead
of being terminated. We mitigate this condition by periodically restarting the
scheduler, which cleans up all the completed tasks correctly.
**What you expected to happen**:
Either KubernetesJobWatcher opens a new connection and goes on processing
events, or it raises an exception shutting the process down so it can be
restarted.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]