andycai-sift opened a new issue, #41163:
URL: https://github.com/apache/airflow/issues/41163
### Official Helm Chart version
1.15.0 (latest released)
### Apache Airflow version
2.7.1
### Kubernetes Version
1.27.2
### Helm Chart configuration
After official helm charts supports HPA, then here is my HPA settings.
```
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: "airflow-worker"
namespace: airflow
labels:
app: airflow
component: worker
release: airflow
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: StatefulSet
name: "airflow-worker"
minReplicas: 2
maxReplicas:16
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 80
```
### Docker Image customizations
Not too much special packages, just added openjdk, mongo client and some
other python related dependency.
### What happened
When I enabled HPA in the Airflow cluster, then during the scale down some
long running tasks which might be over 15hrs got failed and lost the logs. More
details.
- The long running task for example is calling python api to spin up a
google dataproc cluster which will run the job over 15hrs. When the HPA started
scale down the pods, then these tasks failed looks due to the pod terminated
after the `terminationGracePeriodSeconds` was hitting. And then it caused the
task failed, but actually the tasks still waiting the final response status
from dataproc jobs.
- Also after the pod terminated, then the tasks' logs actually was not
published to the GCS bucket. I set the remote log to GCS bucket already. But
the logs actually are still in persistent disk side, I have to spin up the same
pod then can upload the logs to GCS bucket. again.
- The `terminationGracePeriodSeconds` I set is 1hour right now, but due to
the long running tasks so i am not sure how the HPA settings can support it.
### What you think should happen instead
It should be happened during HPA scaling down either the celery work pod
need to graceful shutdown until all tasks are completed or Airflow should
provide some config or something else to support kill the task process before
final termination deadline and complete the cleanup and upload logs to remote
settings like GCS bucket.
### How to reproduce
- Using official helm chart with HPA enabled with CPU/Memory metrics deploy
airflow to Kubernetes.
- Setup remote logging
```
logging:
delete_local_logs: 'False'
remote_base_log_folder: "gs://xxx"
remote_logging: 'True'
```
- Run a long running task maybe just need 30mins or few hours, but you have
to make sure setup `terminationGracePeriodSeconds` maybe to a small number.
- Try to trigger the HPA scale down the pods, then you should see the issue
from the UI you should see the task failed without logs.
### Anything else
_No response_
### Are you willing to submit PR?
- [X] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]