wjddn279 opened a new issue, #56641: URL: https://github.com/apache/airflow/issues/56641
### Apache Airflow version 3.1.0 ### If "Other Airflow 2/3 version" selected, which one? _No response_ ### What happened? related to: https://github.com/apache/airflow/issues/55768#issuecomment-3402928673 Memory occupancy continues to rise in scheduler containers ### What you think should happen instead? Memory usage must not rise in airflow deployed by docker compose. ### How to reproduce Test enviroment: - airflow 3.1.0 official docker image - deployed by docker compose (api-server, scheduler, dag-processor) - 100 Dags with 5 PythonOperator running every minute ### Operating System docker container operated in macOs ### Versions of Apache Airflow Providers _No response_ ### Deployment Docker-Compose ### Deployment details ```` x-airflow-common: &airflow-common image: apache/airflow:3.1.0 environment: &airflow-common-env AIRFLOW__CORE__EXECUTOR: LocalExecutor AIRFLOW__CORE__AUTH_MANAGER: airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: postgresql+psycopg2://airflow:airflow@postgres/airflow AIRFLOW__CORE__FERNET_KEY: '' AIRFLOW__CORE__LOAD_EXAMPLES: 'false' AIRFLOW__CORE__EXECUTION_API_SERVER_URL: 'http://airflow-apiserver:8080/execution/' AIRFLOW__API__SECRET_KEY: 'abc' AIRFLOW__API_AUTH__JWT_SECRET: 'asdasd' AIRFLOW__SCHEDULER__ENABLE_TRACEMALLOC: 'false' volumes: - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags depends_on: &airflow-common-depends-on postgres: condition: service_healthy services: postgres: image: postgres:13 ports: - "5432:5432" environment: POSTGRES_USER: airflow POSTGRES_PASSWORD: airflow POSTGRES_DB: airflow healthcheck: test: ["CMD", "pg_isready", "-U", "airflow"] interval: 10s retries: 5 start_period: 5s restart: always airflow-init: <<: *airflow-common entrypoint: /bin/bash command: - -c - | echo "Creating missing opt dirs if missing:" mkdir -v -p /opt/airflow/{logs,dags,plugins,config} echo "Airflow version:" /entrypoint airflow version echo "Running airflow config list to create default config file if missing." /entrypoint airflow config list >/dev/null echo "Change ownership of files in /opt/airflow to ${AIRFLOW_UID}:0" chown -R "${AIRFLOW_UID}:0" /opt/airflow/ environment: <<: *airflow-common-env _AIRFLOW_DB_MIGRATE: 'true' _AIRFLOW_WWW_USER_CREATE: 'true' _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow} _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow} _PIP_ADDITIONAL_REQUIREMENTS: '' user: "0:0" depends_on: <<: *airflow-common-depends-on airflow-apiserver: <<: *airflow-common command: api-server ports: - "8080:8080" healthcheck: test: ["CMD", "curl", "--fail", "http://localhost:8080/api/v2/version"] interval: 30s timeout: 10s retries: 5 start_period: 30s restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully airflow-scheduler: <<: *airflow-common command: scheduler healthcheck: test: ["CMD", "airflow", "jobs", "check", "--job-type", "SchedulerJob"] interval: 30s timeout: 10s retries: 5 restart: always depends_on: postgres: condition: service_healthy airflow-init: condition: service_completed_successfully airflow-dag-processor: <<: *airflow-common command: dag-processor healthcheck: test: ["CMD-SHELL", 'airflow jobs check --job-type DagProcessorJob --hostname "$${HOSTNAME}"'] interval: 30s timeout: 10s retries: 5 start_period: 30s restart: always depends_on: <<: *airflow-common-depends-on airflow-init: condition: service_completed_successfully ```` ### Anything else? As mentioned earlier https://github.com/apache/airflow/issues/55768#issuecomment-3374894204, the observed memory increase originates from both the scheduler process and its subprocesses — the LocalExecutor workers. The scheduler’s own memory growth has already been analyzed and discussed by @kaxil https://github.com/apache/airflow/issues/55768#issuecomment-3353598174, so I will not cover it here. When running with the LocalExecutor, the default number of worker processes is 32. Since any memory increase per worker is multiplied across all 32 workers, even small leaks can have a critical impact on overall memory usage. I used Memray to analyze the worker processes (which are child processes of the scheduler) and identified three main causes of excessive memory allocation within them. ### 1. Importing the k8s client object First, here is the result of analyzing a single worker process: [memray-flamegraph-output-111.html](https://github.com/user-attachments/files/22916972/memray-flamegraph-output-111.html) [In this section](https://github.com/apache/airflow/blob/main/shared/secrets_masker/src/airflow_shared/secrets_masker/secrets_masker.py#L164C10-L164C47), I confirmed that approximately 32 MB of memory is allocated per worker. Although the code only appears to reference the object’s type, it actually triggers imports of all underlying submodules. Since each worker imports these modules independently, this results in an additional ~1 GB of total memory allocation across all workers. ### 2. Increasing memory from client SSL objects After modifying the problematic code in (1) to prevent the import, I ran memory profiling again. While the initial memory footprint per worker was significantly reduced, I still observed gradual memory growth over time. (0928 means the stats is reported in 09:28) [remove-k8s-0928.html](https://github.com/user-attachments/files/22916986/remove-k8s-0928.html) [remove-k8s-1001.html](https://github.com/user-attachments/files/22916988/remove-k8s-1001.html) [remove-k8s-1035.html](https://github.com/user-attachments/files/22916990/remove-k8s-1035.html) [In the following section](https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/api/client.py#L828), the SSL initialization appears not to properly release memory. Within about 30 minutes, a single worker’s memory grew from 8 MB → 23 MB, later exceeding 50 MB, and continued to increase steadily thereafter. ### 3. Memory inheritance from the parent process due to lazy forking After addressing issues (1) and (2), I verified that the overall memory consumption remained stable and did not exhibit continuous growth. However, I noticed that while initial PSS values were low, they gradually increased to relatively high levels over time. [memory_smem.txt](https://github.com/user-attachments/files/22917005/memory_smem.txt) It was difficult to track the exact distribution using Memray due to extensive shared memory usage — very little heap memory remained in the workers themselves. My hypothesis is as follows: Unlike Airflow 2.x, version 3.x introduced lazy worker initialization. As a result, when the scheduler (already holding significant memory) forks a new worker, Copy-on-Write (CoW) causes shared pages to be duplicated across workers, leading to increased per-process memory consumption. ### Conclusion To verify this hypothesis, I modified the code to eagerly spawn worker processes before the scheduler enters its scheduling loop — effectively disabling lazy forking. The experiment showed that worker memory usage remained stable and no longer exhibited the previous pattern of gradual growth. [memory_smem_2.txt](https://github.com/user-attachments/files/22917012/memory_smem_2.txt) ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
