mschueler opened a new issue, #32533:
URL: https://github.com/apache/airflow/issues/32533
### Apache Airflow version
Other Airflow 2 version (please specify below)
### What happened
We are seeing intermittent SIGTERMs on DAGs. There seems to be no rhyme or
reason to the SIGTERMs (e.g. seems to happen to all our DAGs at some time or
another, no pattern to the timing, etc)
The deploy is thru Helm chart to an EKS cluster running on EKS. It's
happening in our nonprod and prod clusters both. We've tried different things
in our nonprod environment to fix it, basically following ideas we found from
Google searches (increasing resources, upgrading airflow version, checking logs
[we've found nothing useful in the logs but will post as much info as I can
here], increasing timeouts, and trying some settings* we found mentioned in
other GitHub issues.
` - name: AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME
value: "3600"
- name: AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
value: "false"`
Nonprod: EKS 1.26 / Airflow 2.5.1.
Prod: EKS 1.25 / Airflow 2.2.4
Focusing efforts on Nonprod but just wanted to mention we're seeing the
issue on multiple versions. Also, believe the original version we started on
was 2.0.x something but we've been struggling with this issue since January
(when we first started to setup Airflow 2.0 on k8s). As a workaround we are
doing a retry where possible.
This is the exact error:
`airflow.exceptions.AirflowException: Task received SIGTERM signal`
Would truly appreciate any help or insight into what we're doing wrong.
I've tried to put as much information below as possible but if I'm missing
something, please let me know.
Helm values file:
```
#` User and group of airflow user
airflowHome: /opt/airflow
airflowPodAnnotations:
ad.datadoghq.com/tolerate-unready: "true"
ad.datadoghq.com/webserver.check_names: '["airflow"]'
ad.datadoghq.com/webserver.init_configs: "[{}]"
ad.datadoghq.com/webserver.instances: '[{"url":
"https://airflow.dev.eks.xxxx.com"}]'
ad.datadoghq.com/webserver.logs: '[{"source":"airflow", "service":
"airflow"}]'
defaultAirflowRepository: apache/airflow
defaultAirflowTag: 2.5.2
airflowVersion: 2.5.2
##########################################
## COMPONENT | Airflow images and gitsync
##########################################
images:
airflow:
pullPolicy: IfNotPresent
repository: 007601687147.dkr.ecr.us-east-1.amazonaws.com/airflow
tag: "33-dev"
gitSync:
pullPolicy: IfNotPresent
repository: k8s.gcr.io/git-sync/git-sync
tag: v3.3.0
pgbouncer:
pullPolicy: IfNotPresent
repository: apache/airflow
tag: airflow-pgbouncer-2021.04.28-1.14.0
pgbouncerExporter:
pullPolicy: IfNotPresent
repository: apache/airflow
tag: airflow-pgbouncer-exporter-2021.09.22-0.12.0
pod_template:
pullPolicy: IfNotPresent
repository: null
tag: null
statsd:
pullPolicy: IfNotPresent
repository: apache/airflow
tag: airflow-statsd-exporter-2021.04.28-v0.17.0
##########################################
## COMPONENT | Load balancer configs
##########################################
ingress:
enabled: true
web:
annotations:
kubernetes.io/ingress.class: alb
alb.ingress.kubernetes.io/scheme: internal
alb.ingress.kubernetes.io/target-type: ip
alb.ingress.kubernetes.io/success-codes: 200,302
alb.ingress.kubernetes.io/inbound-cidrs:
10.0.0.0/8,172.16.0.0/12,192.168.0.0/16
alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS":443}]'
alb.ingress.kubernetes.io/actions.ssl-redirect: '{"Type": "redirect",
"RedirectConfig": { "Protocol": "HTTPS", "Port": "443", "StatusCode":
"HTTP_301"}}'
alb.ingress.kubernetes.io/tags:
environment=dev,[email protected],business_app=eks-cluster,Name=airflow-ingress
alb.ingress.kubernetes.io/certificate-arn:
arn:aws:acm:us-east-1:xxxx:certificate/a26e6adb-9e75-4048-baa1-8ae08e2f8dd4
path: "/*"
pathType: "ImplementationSpecific"
hosts:
- airflow.dev.eks.xxxxx.com
precedingPaths:
- path: "/*"
serviceName: "ssl-redirect"
servicePort: "use-annotation"
pathType: "ImplementationSpecific"
succeedingPaths: []
tls:
enabled: false
secretName: ""
# `airflow_local_settings` file as a string (can be templated).
airflowLocalSettings: null
# Enable RBAC (default on most clusters these days)
rbac:
create: true
# Airflow executor
executor: KubernetesExecutor
# Environment variables for all airflow containers
env:
- name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
value: DEBUG
allowPodLaunching: true
# Custom secrets
extraSecrets:
airflow-ssh-secret:
data: |
gitSshKey: 'xxxx'
airflow-db:
data: |
connection: 'xxxxxx'
# Extra env 'items' that will be added to the definition of airflow
containers
extraEnv: |-
- name: AIRFLOW__METRICS__STATSD_HOST
valueFrom:
fieldRef:
fieldPath: status.hostIP
- name: AWS_DEFAULT_REGION
value: us-east-1
- name: AIRFLOW__LOGGING__FAB_LOGGING_LEVEL
value: INFO
- name: AIRFLOW__CORE__KILLED_TASK_CLEANUP_TIME
value: "3600"
- name: AIRFLOW__SCHEDULER__SCHEDULE_AFTER_TASK_EXECUTION
value: "false"
# Airflow database config
data:
metadataSecretName: airflow-db
# Fernet key settings
# Note: fernetKey can only be set during install, not upgrade
fernetKey: null
fernetKeySecretName: null
###################################
## COMPONENT | Airflow Workers
###################################
workers:
persistence:
enabled: false
fixPermissions: false
nodeSelector:
node.kubernetes.io/instance-type: c5d.2xlarge
tolerations:
- effect: NoSchedule
key: allowed_jobs
value: airflow
operator: Equal
resources:
limits:
memory: 4000Mi
requests:
memory: 4000Mi
replicas: 1
safeToEvict: true
serviceAccount:
create: true
name: null
annotations:
eks.amazonaws.com/role-arn:
arn:aws:iam::xxxxx:role/airflow-eks-devops-dev-s3-irsa
strategy:
rollingUpdate:
maxSurge: 100%
maxUnavailable: 50%
updateStrategy: null
extraVolumes:
- name: temp-worker-data
persistentVolumeClaim:
claimName: airflow-temp-workers-efs-claim
extraVolumeMounts:
- name: temp-worker-data
mountPath: /opt/airflow/worker_data/
###################################
## COMPONENT | Airflow Scheduler
###################################
scheduler:
livenessProbe:
failureThreshold: 5
initialDelaySeconds: 10
periodSeconds: 60
timeoutSeconds: 20
nodeSelector:
node.kubernetes.io/instance-type: t3.2xlarge
podDisruptionBudget:
config:
maxUnavailable: 1
enabled: false
replicas: 1
safeToEvict: true
serviceAccount:
create: true
name: null
annotations:
eks.amazonaws.com/role-arn:
arn:aws:iam::xxxxx:role/airflow-eks-devops-dev-s3-irsa
extraVolumes:
- name: temp-worker-data
persistentVolumeClaim:
claimName: airflow-temp-workers-efs-claim
extraVolumeMounts:
- name: temp-worker-data
mountPath: /opt/airflow/worker_data/
###################################
## COMPONENT | Airflow Webserver
###################################
webserver:
allowPodLogReading: true
defaultUser:
email: [email protected]
enabled: true
firstName: admin1
lastName: user
password: xxxx
role: Admin
username: admin
livenessProbe:
initialDelaySeconds: 15
timeoutSeconds: 30
failureThreshold: 20
periodSeconds: 5
readinessProbe:
failureThreshold: 20
initialDelaySeconds: 15
periodSeconds: 5
timeoutSeconds: 30
serviceAccount:
create: true
name: ~
annotations:
replicas: 2
service:
type: ClusterIP
ports:
- name: airflow-ui
port: 80
targetPort: airflow-ui
strategy: null
webserverConfig: |-
import os
from airflow import configuration as conf
from flask_appbuilder.security.manager import AUTH_LDAP
CSRF_ENABLED = True
AUTH_TYPE = AUTH_LDAP
AUTH_LDAP_SERVER = "ldap://ldap-aws.xxxx.com:389"
AUTH_LDAP_USE_TLS = False
AUTH_USER_REGISTRATION = True
AUTH_USER_REGISTRATION_ROLE = "Admin"
AUTH_LDAP_FIRSTNAME_FIELD = "givenName"
AUTH_LDAP_LASTNAME_FIELD = "sn"
AUTH_LDAP_EMAIL_FIELD = "mail"
AUTH_LDAP_USERNAME_FORMAT = "xxx"
AUTH_LDAP_SEARCH = "xxxx"
AUTH_LDAP_UID_FIELD = "SamAccountName"
AUTH_LDAP_SEARCH_FILTER = "xxxxxxx"
AUTH_LDAP_GROUP_FIELD = "memberOf"
AUTH_ROLES_SYNC_AT_LOGIN = True
PERMANENT_SESSION_LIFETIME = 600
# Overriding airflow flower
flower:
enabled: false
# Overriding airflow statsd
statsd:
enabled: false
##########################################
## COMPONENT | PgBouncer
##########################################
pgbouncer:
ciphers: normal
configSecretName: null
enabled: true
logConnections: 1
logDisconnections: 1
maxClientConn: 100
metadataPoolSize: 10
podDisruptionBudget:
config:
maxUnavailable: 1
enabled: false
resultBackendPoolSize: 5
serviceAccount:
create: true
name: null
ssl:
ca: null
cert: null
key: null
sslmode: prefer
# Overriding redis config
redis:
enabled: false
# All ports used by chart
ports:
airflowUI: 8080
pgbouncer: 6543
pgbouncerScrape: 9127
statsdIngest: 8125
workerLogs: 8793
# This runs as a CronJob to cleanup old pods
cleanup:
enabled: true
schedule: "*/15 * * * *"
serviceAccount:
create: true
name: airflow
# Overriding postgres config
postgresql:
enabled: false
# Config settings to go into the mounted airflow.cfg
config:
core:
load_examples: "False"
load_default_connections: "False"
parallelism: 300
default_pool_task_slot_count: 300
max_active_tasks_per_dag: 100
max_active_runs_per_dag: 1
executor: "{{ .Values.executor }}"
remote_logging: "True"
dagbag_import_timeout: 60
email:
email_backend: airflow.utils.email.send_email_smtp
smtp:
smtp_host: mailhost.dynata.com
smtp_starttls: False
smtp_ssl: False
smtp_port: 25
smtp_mail_from: [email protected]
logging:
colored_console_log: "False"
remote_logging: "True"
remote_base_log_folder: s3://airflow-dev-eks
remote_log_conn_id: aws_default
metrics:
statsd_on: true
statsd_port: 8125
statsd_prefix: airflow
webserver:
base_url: https://airflow.dev.eks.dynata.com
scheduler:
enable_health_check: "True"
kubernetes:
worker_pods_creation_batch_size: 100
# Overriding pod template
podTemplate: null
###################################
## COMPONENT | Airflow dags
###################################
dags:
gitSync:
# branch: airflow-v2-testin
branch: main
containerName: git-sync
depth: 1
enabled: true
env: []
extraVolumeMounts: []
maxFailures: 0
repo: [email protected]:dynata/airflow.git
rev: HEAD
sshKeySecret: airflow-ssh-secret
# subPath: dags_v2
subPath: dags
uid: 50000
wait: 30
persistence:
enabled: false
# Overriding logs
logs:
persistence:
enabled: false
```
resulting Airflow.cfg (configmap)
```
[celery]
flower_url_prefix = /
worker_concurrency = 16
[celery_kubernetes_executor]
kubernetes_queue = kubernetes
[core]
colored_console_log = False
dagbag_import_timeout = 60
dags_folder = /opt/airflow/dags/repo/dags
default_pool_task_slot_count = 300
executor = KubernetesExecutor
load_default_connections = False
load_examples = False
max_active_runs_per_dag = 1
max_active_tasks_per_dag = 100
parallelism = 300
remote_logging = True
[elasticsearch]
json_format = True
log_id_template = {dag_id}_{task_id}_{execution_date}_{try_number}
[elasticsearch_configs]
max_retries = 3
retry_timeout = True
timeout = 30
[email]
email_backend = airflow.utils.email.send_email_smtp
[kerberos]
ccache = /var/kerberos-ccache/cache
keytab = /etc/airflow.keytab
principal = [email protected]
reinit_frequency = 3600
[kubernetes]
airflow_configmap = airflow-airflow-config
airflow_local_settings_configmap = airflow-airflow-config
multi_namespace_mode = False
namespace = data-platform
pod_template_file = /opt/airflow/pod_templates/pod_template_file.yaml
worker_container_repository = xxxxx.dkr.ecr.us-east-1.amazonaws.com/airflow
worker_container_tag = 33-dev
worker_pods_creation_batch_size = 100
[logging]
colored_console_log = False
remote_base_log_folder = s3://airflow-dev-xxxx
remote_log_conn_id = aws_default
remote_logging = True
[metrics]
statsd_host = airflow-statsd
statsd_on = true
statsd_port = 8125
statsd_prefix = airflow
[scheduler]
enable_health_check = True
run_duration = 41460
standalone_dag_processor = False
statsd_host = airflow-statsd
statsd_on = False
statsd_port = 9125
statsd_prefix = airflow
[smtp]
smtp_host = mailhost.xxx.com
smtp_mail_from = [email protected]
smtp_port = 25
smtp_ssl = false
smtp_starttls = false
[webserver]
base_url = https://airflow.dev.eks.xxxx.com
enable_proxy_fix = True
rbac = True
```
### What you think should happen instead
_No response_
### How to reproduce
Intermittent. Schedule a DAG run.
### Operating System
Kubernetes -- DAGs running on image based on Debian Bullseye
### Versions of Apache Airflow Providers
apache-airflow-providers-amazon | 7.3.0 | Amazon integration (including
Amazon Web Services (AWS)).
-- | -- | --
apache-airflow-providers-celery | 3.1.0 | Celery
apache-airflow-providers-cncf-kubernetes | 5.2.2 | Kubernetes
apache-airflow-providers-common-sql | 1.3.4 | Common SQL Provider
apache-airflow-providers-datadog | 2.0.4 | Datadog
apache-airflow-providers-docker | 3.5.1 | Docker
apache-airflow-providers-elasticsearch | 4.4.0 | Elasticsearch
apache-airflow-providers-ftp | 3.3.1 | File Transfer Protocol (FTP)
apache-airflow-providers-google | 8.11.0 | Google services including: -
Google Ads - Google Cloud (GCP) - Google Firebase - Google LevelDB - Google
Marketing Platform - Google Workspace (formerly Google Suite)
apache-airflow-providers-grpc | 3.1.0 | gRPC
apache-airflow-providers-hashicorp | 3.3.0 | Hashicorp including Hashicorp
Vault
apache-airflow-providers-http | 4.2.0 | Hypertext Transfer Protocol (HTTP)
apache-airflow-providers-imap | 3.1.1 | Internet Message Access Protocol
(IMAP)
apache-airflow-providers-microsoft-azure | 5.2.1 | Microsoft Azure
apache-airflow-providers-microsoft-mssql | 2.1.3 | Microsoft SQL Server
(MSSQL)
apache-airflow-providers-mysql | 2.2.3 | MySQL
apache-airflow-providers-odbc | 3.2.1 | ODBC
apache-airflow-providers-oracle | 2.2.3 | Oracle
apache-airflow-providers-postgres | 5.4.0 | PostgreSQL
apache-airflow-providers-redis | 3.1.0 | Redis
apache-airflow-providers-sendgrid | 3.1.0 | Sendgrid
apache-airflow-providers-sftp | 2.6.0 | SSH File Transfer Protocol (SFTP)
apache-airflow-providers-slack | 7.2.0 | Slack
apache-airflow-providers-snowflake | 2.1.1 | Snowflake
apache-airflow-providers-sqlite | 3.3.1 | SQLite
apache-airflow-providers-ssh | 3.5.0 | Secure Shell (SSH)
apache-airflow-providers-tableau | 2.1.8 | Tableau
</body></html>[apache-airflow-providers-amazon](https://airflow.apache.org/docs/apache-airflow-providers-amazon/7.3.0/)
7.3.0 Amazon integration (including [Amazon Web Services
(AWS)](https://aws.amazon.com/)).
[apache-airflow-providers-celery](https://airflow.apache.org/docs/apache-airflow-providers-celery/3.1.0/)
3.1.0 [Celery](http://www.celeryproject.org/)
[apache-airflow-providers-cncf-kubernetes](https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/5.2.2/)
5.2.2 [Kubernetes](https://kubernetes.io/)
[apache-airflow-providers-common-sql](https://airflow.apache.org/docs/apache-airflow-providers-common-sql/1.3.4/)
1.3.4 [Common SQL Provider](https://en.wikipedia.org/wiki/SQL)
[apache-airflow-providers-datadog](https://airflow.apache.org/docs/apache-airflow-providers-datadog/2.0.4/)
2.0.4 [Datadog](https://www.datadoghq.com/)
[apache-airflow-providers-docker](https://airflow.apache.org/docs/apache-airflow-providers-docker/3.5.1/)
3.5.1 [Docker](https://docs.docker.com/install/)
[apache-airflow-providers-elasticsearch](https://airflow.apache.org/docs/apache-airflow-providers-elasticsearch/4.4.0/)
4.4.0 [Elasticsearch](https://www.elastic.co/elasticsearch)
[apache-airflow-providers-ftp](https://airflow.apache.org/docs/apache-airflow-providers-ftp/3.3.1/)
3.3.1 [File Transfer Protocol (FTP)](https://tools.ietf.org/html/rfc114)
[apache-airflow-providers-google](https://airflow.apache.org/docs/apache-airflow-providers-google/8.11.0/)
8.11.0 Google services including:
- [Google Ads](https://ads.google.com/)
- [Google Cloud (GCP)](https://cloud.google.com/)
- [Google Firebase](https://firebase.google.com/)
- [Google LevelDB](https://github.com/google/leveldb/)
- [Google Marketing Platform](https://marketingplatform.google.com/)
- [Google Workspace](https://workspace.google.com/) (formerly Google Suite)
[apache-airflow-providers-grpc](https://airflow.apache.org/docs/apache-airflow-providers-grpc/3.1.0/)
3.1.0 [gRPC](https://grpc.io/)
[apache-airflow-providers-hashicorp](https://airflow.apache.org/docs/apache-airflow-providers-hashicorp/3.3.0/)
3.3.0 Hashicorp including [Hashicorp
Vault](https://www.vaultproject.io/)
[apache-airflow-providers-http](https://airflow.apache.org/docs/apache-airflow-providers-http/4.2.0/)
4.2.0 [Hypertext Transfer Protocol
(HTTP)](https://www.w3.org/Protocols/)
[apache-airflow-providers-imap](https://airflow.apache.org/docs/apache-airflow-providers-imap/3.1.1/)
3.1.1 [Internet Message Access Protocol
(IMAP)](https://tools.ietf.org/html/rfc3501)
[apache-airflow-providers-microsoft-azure](https://airflow.apache.org/docs/apache-airflow-providers-microsoft-azure/5.2.1/)
5.2.1 [Microsoft Azure](https://azure.microsoft.com/)
[apache-airflow-providers-microsoft-mssql](https://airflow.apache.org/docs/apache-airflow-providers-microsoft-mssql/2.1.3/)
2.1.3 [Microsoft SQL Server
(MSSQL)](https://www.microsoft.com/en-us/sql-server/sql-server-downloads)
[apache-airflow-providers-mysql](https://airflow.apache.org/docs/apache-airflow-providers-mysql/2.2.3/)
2.2.3 [MySQL](https://www.mysql.com/products/)
[apache-airflow-providers-odbc](https://airflow.apache.org/docs/apache-airflow-providers-odbc/3.2.1/)
3.2.1 [ODBC](https://github.com/mkleehammer/pyodbc/wiki)
[apache-airflow-providers-oracle](https://airflow.apache.org/docs/apache-airflow-providers-oracle/2.2.3/)
2.2.3 [Oracle](https://www.oracle.com/en/database/)
[apache-airflow-providers-postgres](https://airflow.apache.org/docs/apache-airflow-providers-postgres/5.4.0/)
5.4.0 [PostgreSQL](https://www.postgresql.org/)
[apache-airflow-providers-redis](https://airflow.apache.org/docs/apache-airflow-providers-redis/3.1.0/)
3.1.0 [Redis](https://redis.io/)
[apache-airflow-providers-sendgrid](https://airflow.apache.org/docs/apache-airflow-providers-sendgrid/3.1.0/)
3.1.0 [Sendgrid](https://sendgrid.com/)
[apache-airflow-providers-sftp](https://airflow.apache.org/docs/apache-airflow-providers-sftp/2.6.0/)
2.6.0 [SSH File Transfer Protocol
(SFTP)](https://tools.ietf.org/wg/secsh/draft-ietf-secsh-filexfer/)
[apache-airflow-providers-slack](https://airflow.apache.org/docs/apache-airflow-providers-slack/7.2.0/)
7.2.0 [Slack](https://slack.com/)
[apache-airflow-providers-snowflake](https://airflow.apache.org/docs/apache-airflow-providers-snowflake/2.1.1/)
2.1.1 [Snowflake](https://www.snowflake.com/)
[apache-airflow-providers-sqlite](https://airflow.apache.org/docs/apache-airflow-providers-sqlite/3.3.1/)
3.3.1 [SQLite](https://www.sqlite.org/)
[apache-airflow-providers-ssh](https://airflow.apache.org/docs/apache-airflow-providers-ssh/3.5.0/)
3.5.0 [Secure Shell (SSH)](https://tools.ietf.org/html/rfc4251)
[apache-airflow-providers-tableau](https://airflow.apache.org/docs/apache-airflow-providers-tableau/2.1.8/)
2.1.8 [Tableau](https://www.tableau.com/)
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
EKS 1.25 running Karpenter (cluster autoscaler replacement)
### Anything else
Intermittent -- 5 - 50x a day
### Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
### Code of Conduct
- [X] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]