lopezvit opened a new issue, #38186: URL: https://github.com/apache/airflow/issues/38186
### Apache Airflow version Other Airflow 2 version (please specify below) ### If "Other Airflow 2 version" selected, which one? 2.6.3 ### What happened? The problem is that, quite often (but not always!), the task that (I guess) Airflow detect as a zombie is not retried; and I don't understand why: for me this is clearly a bug. The tasks is memory intensive, and I guess that it is the underlaying problem. I have increased the worker memory size from 4 GB to 6.5 GB and it hasn't failed yet. But this doesn't look like a very sustainable solution since the memory is expensive, and because, when the task is retried it always succeed (probably because the worker doesn't have so much pressure anymore) as it can be checked by other execution of the same task (as can be checked in the _Anything else?_ section). I have went through the documentation and the troubleshooting and known issues and the only related issue was #37041 but it is hard to tell, since Composer uses Celery executor. What is the business impact you are facing? Task that are failing force a human to go and retry to tasks manually (currently left in failed state to allow better troubleshooting of the issue). The solution of increasing the memory seems expensive, as the issue is not in our code, but in the infrastructure. ### What you think should happen instead? Well, since this happens when there is a moment of high demand, just by simply retrying, **as it should**, should solve the problem without any human intervention. ### How to reproduce We have a quite memory intensive (around 200MB) task that requires to be run every time around 7 times, as it fetches 7 days worth of data from the past, as it might take so many days for the data to be golden. When all these tasks are running in parallel, (and possibly other tasks from other dags) it uses all the memory from the VM, which provokes the task to be killed. This is anyway a rare occurrence, as the DAG is scheduled twice an hour, 16 hours a day and it only happened 18 times during 3 days period. ### Operating System composer-2.5.2-airflow-2.6.3 ### Versions of Apache Airflow Providers directly from the [documentation](https://cloud.google.com/composer/docs/concepts/versioning/composer-versions#images): absl-py==2.0.0 agate==1.6.3 aiodebug==2.3.0 aiofiles==23.2.1 aiohttp==3.8.6 aiosignal==1.3.1 alembic==1.11.1 amqp==5.1.1 anyio==3.7.1 apache-airflow==2.6.3+composer apache-airflow-providers-apache-beam==5.3.0 apache-airflow-providers-cncf-kubernetes==7.9.0 apache-airflow-providers-common-sql==1.8.0 apache-airflow-providers-dbt-cloud==3.4.0 apache-airflow-providers-ftp==3.6.1 apache-airflow-providers-google==10.11.1 apache-airflow-providers-hashicorp==3.5.0 apache-airflow-providers-http==4.6.0 apache-airflow-providers-imap==3.4.0 apache-airflow-providers-mysql==5.2.0 apache-airflow-providers-postgres==5.8.0 apache-airflow-providers-sendgrid==3.3.0 apache-airflow-providers-sqlite==3.5.0 apache-airflow-providers-ssh==3.8.1 apache-beam==2.51.0 apispec==5.2.2 appdirs==1.4.4 argcomplete==3.1.1 asgiref==3.7.2 astunparse==1.6.3 async-timeout==4.0.2 attrs==23.1.0 Babel==2.12.1 backoff==2.2.1 backports.zoneinfo==0.2.1 bcrypt==4.0.1 billiard==4.1.0 blinker==1.6.2 cachecontrol==0.13.1 cachelib==0.9.0 cachetools==5.3.1 cattrs==23.1.2 celery==5.3.1 certifi==2023.7.22 cffi==1.15.1 chardet==5.2.0 charset-normalizer==3.1.0 click==8.1.3 click-didyoumean==0.3.0 click-plugins==1.1.1 click-repl==0.3.0 clickclick==20.10.2 cloudpickle==2.2.1 colorama==0.4.6 colorlog==4.8.0 ConfigUpdater==3.1.1 connexion==2.14.2 crcmod==1.7 cron-descriptor==1.4.0 croniter==1.4.1 cryptography==41.0.5 db-dtypes==1.1.1 dbt-bigquery==1.5.4 dbt-core==1.5.4 dbt-extractor==0.4.1 decorator==5.1.1 Deprecated==1.2.14 diff-cover==8.0.0 dill==0.3.1.1 distlib==0.3.6 dnspython==2.3.0 docopt==0.6.2 docutils==0.20.1 email-validator==1.3.1 exceptiongroup==1.1.2 fastavro==1.9.0 fasteners==0.19 filelock==3.12.2 firebase-admin==6.2.0 Flask==2.2.5 Flask-AppBuilder==4.3.1 Flask-Babel==2.0.0 Flask-Bcrypt==1.0.1 Flask-Caching==2.0.2 Flask-JWT-Extended==4.5.2 Flask-Limiter==3.3.1 Flask-Login==0.6.2 flask-session==0.5.0 Flask-SQLAlchemy==2.5.1 Flask-WTF==1.1.1 flatbuffers==23.5.26 flower==2.0.0 frozenlist==1.3.3 fsspec==2023.10.0 future==0.18.3 gast==0.4.0 gcloud-aio-auth==4.2.3 gcloud-aio-bigquery==7.0.0 gcloud-aio-storage==9.0.0 gcsfs==2023.10.0 google-ads==22.1.0 google-api-core==2.14.0 google-api-python-client==2.107.0 google-apitools==0.5.32 google-auth==2.23.4 google-auth-httplib2==0.1.1 google-auth-oauthlib==1.0.0 google-cloud-access-context-manager==0.1.16 google-cloud-aiplatform==1.36.2 google-cloud-appengine-logging==1.3.2 google-cloud-asset==3.20.0 google-cloud-audit-log==0.2.5 google-cloud-automl==2.11.3 google-cloud-batch==0.17.3 google-cloud-bigquery==3.13.0 google-cloud-bigquery-datatransfer==3.12.1 google-cloud-bigquery-storage==2.22.0 google-cloud-bigtable==2.21.0 google-cloud-build==3.21.0 google-cloud-common==1.2.0 google-cloud-compute==1.14.1 google-cloud-container==2.33.0 google-cloud-core==2.3.3 google-cloud-datacatalog==3.16.0 google-cloud-datacatalog-lineage==0.3.1 google-cloud-datacatalog-lineage-producer-client==0.1.0 google-cloud-dataflow-client==0.8.5 google-cloud-dataform==0.5.4 google-cloud-dataplex==1.8.1 google-cloud-dataproc==5.7.0 google-cloud-dataproc-metastore==1.13.0 google-cloud-datastore==2.18.0 google-cloud-dlp==3.13.0 google-cloud-documentai==2.20.2 google-cloud-filestore==1.6.2 google-cloud-firestore==2.13.1 google-cloud-kms==2.19.2 google-cloud-language==2.11.1 google-cloud-logging==3.8.0 google-cloud-memcache==1.7.3 google-cloud-monitoring==2.16.0 google-cloud-orchestration-airflow==1.9.2 google-cloud-org-policy==1.8.3 google-cloud-os-config==1.15.3 google-cloud-os-login==2.11.0 google-cloud-pubsub==2.18.4 google-cloud-pubsublite==0.6.1 google-cloud-redis==2.13.2 google-cloud-resource-manager==1.10.4 google-cloud-run==0.10.0 google-cloud-secret-manager==2.16.4 google-cloud-spanner==3.40.1 google-cloud-speech==2.22.0 google-cloud-storage==2.13.0 google-cloud-storage-transfer==1.9.2 google-cloud-tasks==2.14.2 google-cloud-texttospeech==2.14.2 google-cloud-translate==3.12.1 google-cloud-videointelligence==2.11.4 google-cloud-vision==3.4.5 google-cloud-workflows==1.12.1 google-crc32c==1.5.0 google-pasta==0.2.0 google-re2==1.1 google-resumable-media==2.6.0 googleapis-common-protos==1.60.0 graphviz==0.20.1 greenlet==2.0.2 grpc-google-iam-v1==0.12.7 grpcio==1.59.2 grpcio-gcp==0.2.2 grpcio-status==1.59.2 gunicorn==20.1.0 h11==0.14.0 h5py==3.10.0 hdfs==2.7.3 hologram==0.0.16 httpcore==0.17.3 httplib2==0.22.0 httpx==0.24.1 humanize==4.7.0 hvac==2.0.0 idna==3.4 importlib-metadata==4.13.0 importlib-resources==5.12.0 inflection==0.5.1 iniconfig==2.0.0 isodate==0.6.1 itsdangerous==2.1.2 jaraco.classes==3.3.0 jeepney==0.8.0 Jinja2==3.1.2 Js2Py==0.74 json-merge-patch==0.2 jsonschema==4.18.6 jsonschema-specifications==2023.7.1 keras==2.13.1 keyring==24.3.0 keyrings.google-artifactregistry-auth==1.1.2 kombu==5.3.1 kubernetes==23.6.0 kubernetes-asyncio==24.2.3 lazy-object-proxy==1.9.0 leather==0.3.4 libclang==16.0.6 limits==3.5.0 linkify-it-py==2.0.2 lockfile==0.12.2 Logbook==1.5.3 looker-sdk==23.16.0 Mako==1.2.4 Markdown==3.4.3 markdown-it-py==3.0.0 MarkupSafe==2.1.3 marshmallow==3.19.0 marshmallow-enum==1.5.1 marshmallow-oneofschema==3.0.1 marshmallow-sqlalchemy==0.26.1 mashumaro==3.6 mdit-py-plugins==0.4.0 mdurl==0.1.2 minimal-snowplow-tracker==0.0.2 more-itertools==10.1.0 msgpack==1.0.5 multidict==6.0.4 mysqlclient==2.2.0 networkx==2.8.8 numpy==1.24.3 oauth2client==4.1.3 oauthlib==3.2.2 objsize==0.6.1 opt-einsum==3.3.0 ordered-set==4.1.0 orjson==3.9.10 overrides==6.5.0 packaging==23.1 pandas==2.0.3 pandas-gbq==0.19.2 paramiko==3.3.1 parsedatetime==2.4 pathspec==0.9.0 pendulum==2.1.2 pip==20.2.4 pipdeptree==2.13.1 pkgutil-resolve-name==1.3.10 platformdirs==3.8.1 pluggy==1.2.0 prison==0.2.1 prometheus-client==0.17.0 prompt-toolkit==3.0.39 proto-plus==1.22.3 protobuf==4.24.4 psutil==5.9.5 psycopg2-binary==2.9.9 pyarrow==11.0.0 pyasn1==0.5.0 pyasn1-modules==0.3.0 pycparser==2.21 pydantic==1.10.12 pydata-google-auth==1.8.2 pydot==1.4.2 Pygments==2.16.1 pyjsparser==2.7.1 PyJWT==2.7.0 pymongo==4.6.0 PyNaCl==1.5.0 pyOpenSSL==23.3.0 pyparsing==3.1.1 pytest==7.4.3 python-daemon==3.0.1 python-dateutil==2.8.2 python-http-client==3.3.7 python-nvd3==0.15.0 python-slugify==8.0.1 pytimeparse==1.1.8 pytz==2023.3 pytzdata==2020.1 PyYAML==6.0 redis==3.5.3 referencing==0.30.2 regex==2023.10.3 requests==2.31.0 requests-oauthlib==1.3.1 requests-toolbelt==1.0.0 rfc3339-validator==0.1.4 rich==13.4.2 rich-argparse==1.2.0 rpds-py==0.10.0 rsa==4.9 SecretStorage==3.3.3 sendgrid==6.10.0 setproctitle==1.3.2 setuptools==66.1.1 shapely==2.0.2 six==1.16.0 sniffio==1.3.0 SQLAlchemy==1.4.49 sqlalchemy-bigquery==1.8.0 SQLAlchemy-JSONField==1.0.1.post0 sqlalchemy-spanner==1.6.2 SQLAlchemy-Utils==0.41.1 sqlfluff==2.3.3 sqllineage==1.4.8 sqlparse==0.4.4 sshtunnel==0.4.0 starkbank-ecdsa==2.2.0 statsd==4.0.1 tabulate==0.9.0 tblib==2.0.0 tenacity==8.2.2 tensorboard==2.13.0 tensorboard-data-server==0.7.2 tensorflow==2.13.1 tensorflow-estimator==2.13.0 tensorflow-io-gcs-filesystem==0.34.0 termcolor==2.3.0 text-unidecode==1.3 toml==0.10.2 tomli==2.0.1 tornado==6.3.2 tqdm==4.66.1 typing-extensions==4.5.0 tzdata==2023.3 tzlocal==5.2 uc-micro-py==1.0.2 unicodecsv==0.14.1 uritemplate==4.1.1 urllib3==1.26.18 vine==5.0.0 virtualenv==20.23.1 wcwidth==0.2.6 websocket-client==1.6.1 Werkzeug==2.2.3 wheel==0.41.3 wrapt==1.15.0 WTForms==3.0.1 yarl==1.9.2 zipp==3.15.0 zstandard==0.22.0 ### Deployment Google Cloud Composer ### Deployment details ### Version `composer-2.5.2-airflow-2.6.3` ### Airflow Configuration Overrides - scheduler - scheduler_heartbeat_sec: 45 - dag_dir_list_interval: 10 - scheduler_zombie_task_threshold: 500 - catchup_by_default: False - core - max_active_tasks_per_dag: 25 - max_active_runs_per_dag: 5 - dagbag_import_timeout: 60 - dag_concurrency: 25 - dags_are_paused_at_creation: True - webserver - dag_orientation - TB - instance_name: <REDACTED> - navbar_color: #009DE0 - secrets - backend: airflow.providers.google.cloud.secrets.secret_manager.CloudSecretManagerBackend ### Environment Configuration: - Resources - Workloads configuration - Scheduler: One scheduler with 0.5 vCPU, 2 GB memory, 1 GB storage - Triggerer: Disabled - Web server: 0.5 vCPU, 2 GB memory, 1 GB storage - Worker: Auto-scaling between 1 and 5 workers, with 1 vCPU, 4 GB memory, 1 GB storage each - Core infrastructure - Environment size: Small ### Pypi Packages Name | Version ------ | -------- apache-airflow-providers-http | ==4.7.0 Django | ==3.1.14 djangorestframework | ==3.13.1 ### Anything else? ``` *** Reading remote log from gs://<REDACTED>/logs/dag_id=<REDACTED>_integration/run_id=scheduled__2024-02-18T16:48:00+00:00/task_id=read_Orders_endpoint_minus-6/attempt=1.log. [2024-02-18, 19:19:14 EET] {taskinstance.py:1104} INFO - Dependencies all met for dep_context=non-requeueable deps ti=<TaskInstance: <REDACTED>_integration.read_Orders_endpoint_minus-6 scheduled__2024-02-18T16:48:00+00:00 [queued]> [2024-02-18, 19:19:14 EET] {taskinstance.py:1104} INFO - Dependencies all met for dep_context=requeueable deps ti=<TaskInstance: <REDACTED>_integration.read_Orders_endpoint_minus-6 scheduled__2024-02-18T16:48:00+00:00 [queued]> [2024-02-18, 19:19:14 EET] {taskinstance.py:1309} INFO - Starting attempt 1 of 2 [2024-02-18, 19:19:15 EET] {taskinstance.py:1328} INFO - Executing <Task(PythonOperator): read_Orders_endpoint_minus-6> on 2024-02-18 16:48:00+00:00 [2024-02-18, 19:19:15 EET] {standard_task_runner.py:57} INFO - Started process 30752 to run task [2024-02-18, 19:19:15 EET] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', '<REDACTED>_integration', 'read_Orders_endpoint_minus-6', 'scheduled__2024-02-18T16:48:00+00:00', '--job-id', '114249', '--raw', '--subdir', 'DAGS_FOLDER/<REDACTED>/<REDACTED>_dag.py', '--cfg-path', '/tmp/tmpcv7_sbw4'] [2024-02-18, 19:19:15 EET] {standard_task_runner.py:85} INFO - Job 114249: Subtask read_Orders_endpoint_minus-6 [2024-02-18, 19:19:17 EET] {task_command.py:414} INFO - Running <TaskInstance: <REDACTED>_integration.read_Orders_endpoint_minus-6 scheduled__2024-02-18T16:48:00+00:00 [running]> on host airflow-worker-w5f47 [2024-02-18, 19:19:19 EET] {taskinstance.py:1547} INFO - Exporting env vars: AIRFLOW_CTX_DAG_EMAIL='[email protected]' AIRFLOW_CTX_DAG_OWNER='<REDACTED>' AIRFLOW_CTX_DAG_ID='<REDACTED>_integration' AIRFLOW_CTX_TASK_ID='read_Orders_endpoint_minus-6' AIRFLOW_CTX_EXECUTION_DATE='2024-02-18T16:48:00+00:00' AIRFLOW_CTX_TRY_NUMBER='1' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-02-18T16:48:00+00:00' [2024-02-18, 19:19:20 EET] {base.py:73} INFO - Using connection ID '<REDACTED>' for task execution. [2024-02-18, 19:19:29 EET] {logging_mixin.py:150} INFO - Fetching Orders on 2024-02-12 [2024-02-18, 19:19:43 EET] {local_task_job_runner.py:225} INFO - Task exited with return code Negsignal.SIGKILL [2024-02-18, 19:19:44 EET] {taskinstance.py:2656} INFO - 0 downstream tasks scheduled from follow-on schedule check ``` ### Example of the previously correct execution (it did retry): ``` *** Reading remote log from gs://<REDACTED>/logs/dag_id=<REDACTED>_integration/run_id=scheduled__2024-02-18T16:18:00+00:00/task_id=read_Orders_endpoint_minus-6/attempt=2.log. [2024-02-18, 18:54:54 EET] {taskinstance.py:1104} INFO - Dependencies all met for dep_context=non-requeueable deps ti= [2024-02-18, 18:54:54 EET] {taskinstance.py:1104} INFO - Dependencies all met for dep_context=requeueable deps ti= [2024-02-18, 18:54:54 EET] {taskinstance.py:1309} INFO - Starting attempt 2 of 2 [2024-02-18, 18:54:54 EET] {taskinstance.py:1328} INFO - Executing on 2024-02-18 16:18:00+00:00 [2024-02-18, 18:54:54 EET] {standard_task_runner.py:57} INFO - Started process 30071 to run task [2024-02-18, 18:54:54 EET] {standard_task_runner.py:84} INFO - Running: ['airflow', 'tasks', 'run', '<REDACTED>_integration', 'read_Orders_endpoint_minus-6', 'scheduled__2024-02-18T16:18:00+00:00', '--job-id', '114236', '--raw', '--subdir', 'DAGS_FOLDER/<REDACTED>/<REDACTED>_dag.py', '--cfg-path', '/tmp/tmpqa1q7bn5'] [2024-02-18, 18:54:54 EET] {standard_task_runner.py:85} INFO - Job 114236: Subtask read_Orders_endpoint_minus-6 [2024-02-18, 18:54:54 EET] {task_command.py:414} INFO - Running on host airflow-worker-w5f47 [2024-02-18, 18:54:55 EET] {taskinstance.py:1547} INFO - Exporting env vars: AIRFLOW_CTX_DAG_EMAIL='[email protected]' AIRFLOW_CTX_DAG_OWNER='<REDACTED>' AIRFLOW_CTX_DAG_ID='<REDACTED>_integration' AIRFLOW_CTX_TASK_ID='read_Orders_endpoint_minus-6' AIRFLOW_CTX_EXECUTION_DATE='2024-02-18T16:18:00+00:00' AIRFLOW_CTX_TRY_NUMBER='2' AIRFLOW_CTX_DAG_RUN_ID='scheduled__2024-02-18T16:18:00+00:00' [2024-02-18, 18:54:55 EET] {base.py:73} INFO - Using connection ID '<REDACTED>' for task execution. [2024-02-18, 18:54:56 EET] {logging_mixin.py:150} INFO - Fetching Orders on 2024-02-12 [2024-02-18, 18:54:57 EET] {sql_to_gcs.py:161} INFO - Executing query [2024-02-18, 18:54:57 EET] {sql_to_gcs.py:180} INFO - Writing local data files [2024-02-18, 18:54:57 EET] {sql_to_gcs.py:185} INFO - Uploading chunk file #0 to GCS. [2024-02-18, 18:54:58 EET] {base.py:73} INFO - Using connection ID 'google_cloud_default' for task execution. [2024-02-18, 18:54:58 EET] {credentials_provider.py:353} INFO - Getting connection using `google.auth.default()` since no explicit credentials are provided. [2024-02-18, 18:54:58 EET] {gcs.py:562} INFO - File /tmp/tmpo0xnp84a uploaded to Orders/2024/02/12/Orders_20240212.json in <REDACTED>_datafiles_datalake-207612 bucket [2024-02-18, 18:54:58 EET] {sql_to_gcs.py:188} INFO - Removing local file [2024-02-18, 18:54:58 EET] {python.py:183} INFO - Done. Returned value was: None [2024-02-18, 18:54:58 EET] {taskinstance.py:1346} INFO - Marking task as SUCCESS. dag_id=<REDACTED>_integration, task_id=read_Orders_endpoint_minus-6, execution_date=20240218T161800, start_date=20240218T165454, end_date=20240218T165458 [2024-02-18, 18:54:58 EET] {local_task_job_runner.py:225} INFO - Task exited with return code 0 [2024-02-18, 18:54:58 EET] {taskinstance.py:2656} INFO - 1 downstream tasks scheduled from follow-on schedule check ``` ### Are you willing to submit PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
