[I] The Celery Worker Pods terminated and lost task logs when scale down happened during HPA autoscaling [airflow]

via GitHub Wed, 31 Jul 2024 15:24:52 -0700


andycai-sift opened a new issue, #41163:
URL: https://github.com/apache/airflow/issues/41163


   ### Official Helm Chart version
   
   1.15.0 (latest released)
   
   ### Apache Airflow version
   
   2.7.1
   
   ### Kubernetes Version
   
   1.27.2
   
   ### Helm Chart configuration
   
   After official helm charts supports HPA, then here is my HPA settings.
   ```
   apiVersion: autoscaling/v2
   kind: HorizontalPodAutoscaler
   metadata:
     name: "airflow-worker"
     namespace: airflow
     labels:
       app: airflow
       component: worker
       release: airflow
   spec:
     scaleTargetRef:
       apiVersion: apps/v1
       kind: StatefulSet
       name: "airflow-worker"
     minReplicas: 2
     maxReplicas:16
     metrics:
       - type: Resource
           resource:
             name: cpu
             target:
               type: Utilization
               averageUtilization: 70
       - type: Resource
       resource:
         name: memory
         target:
           type: Utilization
           averageUtilization: 80
   
   ```
   
   ### Docker Image customizations
   
   Not too much special packages, just added openjdk, mongo client and some 
other python related dependency.
   
   ### What happened
   
   When I enabled HPA in the Airflow cluster, then during the scale down some 
long running tasks which might be over 15hrs got failed and lost the logs. More 
details.
   - The long running task for example is calling python api to spin up a 
google dataproc cluster which will run the job over 15hrs. When the HPA started 
scale down the pods, then these tasks failed looks due to the pod terminated 
after the `terminationGracePeriodSeconds` was hitting. And then it caused the 
task failed, but actually the tasks still waiting the final response status 
from dataproc jobs.
   - Also after the pod terminated, then the tasks' logs actually was not 
published to the GCS bucket. I set the remote log to GCS bucket already. But 
the logs actually are still in persistent disk side, I have to spin up the same 
pod then can upload the logs to GCS bucket. again.
   - The `terminationGracePeriodSeconds` I set is 1hour right now, but due to 
the long running tasks so i am not sure how the HPA settings can support it.
   
   ### What you think should happen instead
   
   It should be happened during HPA scaling down either the celery work pod 
need to graceful shutdown until all tasks are completed or Airflow should 
provide some config or something else to support kill the task process before 
final termination deadline and complete the cleanup and upload logs to remote 
settings like GCS bucket.
   
   ### How to reproduce
   
   - Using official helm chart with HPA enabled with CPU/Memory metrics deploy 
airflow to Kubernetes.
   - Setup remote logging
   ```
   logging:
       delete_local_logs: 'False'
       remote_base_log_folder: "gs://xxx"
       remote_logging: 'True'
   ```
   - Run a long running task maybe just need 30mins or few hours, but you have 
to make sure setup `terminationGracePeriodSeconds` maybe to a small number.
   - Try to trigger the HPA scale down the pods, then you should see the issue 
from the UI you should see the task failed without logs.
   
   ### Anything else
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] The Celery Worker Pods terminated and lost task logs when scale down happened during HPA autoscaling [airflow]

Reply via email to