moiseenkov commented on issue #23497:
URL: https://github.com/apache/airflow/issues/23497#issuecomment-1346520028

   I've finally managed to reproduce this bug with the following DAG on 
composer-2.0.29-airflow-2.3.3:
   ```python
   import datetime
   
   from airflow import models
   from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import 
KubernetesPodOperator
   
   YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
   
   with models.DAG(
       dag_id="composer_sample_kubernetes_pod",
       schedule_interval=datetime.timedelta(days=1),
       start_date=YESTERDAY,
   ) as dag:
       timeout = 240
       iterations = 600 * 1000
       arguments = \
           'for i in {1..%(iterations)s}; do echo "$i of %(iterations)s"; done' 
% {'iterations': iterations}
   
       kubernetes_min_pod_0 = KubernetesPodOperator(
           task_id="pod-ex-minimum-0",
           name="pod-ex-minimum-0",
           cmds=["/bin/bash", "-c"],
           arguments=[arguments],
           namespace="default",
           image="gcr.io/gcp-runtimes/ubuntu_18_0_4",
           startup_timeout_seconds=timeout
       )
   ```
   With this example a container prints 600K log messages and terminates very 
fast. Meanwhile a Kubernetes API is pulling chunks of container logs from a 
stream. The pulling process is much slower, and thus eventually we get a 
situation, when the container is terminated but we're still pulling logs. The 
pulling process continues after the container termination for about 2-3 
minutes. It looks to me that logs are being cached somewhere on a lower level, 
and once this cache gets exhausted, the stream hangs. Perhaps it should check a 
socket or connection status, but in practice it just hangs.
   
   Here's the line of code that hangs in Airflow's side: 
https://github.com/apache/airflow/blob/395a34b960c73118a732d371e93aeab8dcd76275/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L232
   
   And here's the underlying line of code  that hangs on urllib3's side: 
https://github.com/urllib3/urllib3/blob/d393b4a5091c27d2e158074f81feb264c5c175af/src/urllib3/response.py#L999
   
   If I'm right, then the source of the issue belongs to third-party libraries 
(Kubernetes API or urllib3). In this case the easiest solution would be 
checking the container status before pulling each chunk of logs from the 
`urllib3.reponse.HTTPResponse`. A more robust solution would be caching logs 
into a temporary storage and fetching them from this new source independently 
from a container life-cycle, but I'm not sure it's possible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to