schattian opened a new issue, #23496: URL: https://github.com/apache/airflow/issues/23496
I observed that some workers stopped randomly after being running. After some investigation, the issue is in the new kubernetes pod operator and is dependant of a current issue in the kubernetes api. When a log rotate event occurs in kubernetes, the stream we consume on `fetch_container_logs(follow=True,...)` is no longer being feeded. Therefore, the k8s pod operator hangs indefinetly at the middle of the log. Only a sigterm could terminate it as logs consumption is blocking `execute()` to finish. Ref to the issue in kubernetes: https://github.com/kubernetes/kubernetes/issues/59902 However, I think there are many possibilities to walk-around this from airflow-side (like making them not-blocking and block until `status.phase.completed` as it's currently done when `get_logs` is not true). Linking https://github.com/apache/airflow/issues/12103 for ref, as the result is more or less the same (although the root cause is different) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
