schattian opened a new issue, #23496:
URL: https://github.com/apache/airflow/issues/23496

   I observed that some workers stopped randomly after being running. 
   After some investigation, the issue is in the new kubernetes pod operator 
and is dependant of a current issue in the kubernetes api.
   
   When a log rotate event occurs in kubernetes, the stream we consume on 
`fetch_container_logs(follow=True,...)` is no longer being feeded.
   
   Therefore, the k8s pod operator hangs indefinetly at the middle of the log. 
Only a sigterm could terminate it as logs consumption is blocking `execute()` 
to finish.
   
   Ref to the issue in kubernetes: 
https://github.com/kubernetes/kubernetes/issues/59902 
   
   However, I think there are many possibilities to walk-around this from 
airflow-side (like making them not-blocking and block until 
`status.phase.completed` as it's currently done when `get_logs` is not true).
   
   
   Linking https://github.com/apache/airflow/issues/12103 for ref, as the 
result is more or less the same (although the root cause is different)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to