guillemborrell opened a new issue #12644:
URL: https://github.com/apache/airflow/issues/12644


   **Apache Airflow version**:
   1.10.10 and 1.10.12 so far
   
   **What happened**:
   We are running Airflow with the KubernetesExecutor on a Kubernetes cluster 
with a somewhat unreliable network. The scheduler runs smoothly until it logs 
the following line:
   
   ```
   WARNING - HTTPError when attempting to run task, re-queueing. Exception: 
('Connection aborted.', OSError("(110, 'ETIMEDOUT')"))
   ```
   From then on, the executor is able to schedule jobs, but the job watcher 
stops processing events. The resulting behaviour is that new jobs are 
scheduled, but all jobs with status `Completed` remain in the cluster instead 
of being terminated. We mitigate this condition by periodically restarting the 
scheduler, which cleans up all the completed tasks correctly.
   
   **What you expected to happen**:
   Either KubernetesJobWatcher opens a new connection and goes on processing 
events, or it raises an exception shutting the process down so it can be 
restarted.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to