wjddn279 opened a new issue, #56641:
URL: https://github.com/apache/airflow/issues/56641

   ### Apache Airflow version
   
   3.1.0
   
   ### If "Other Airflow 2/3 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   related to: 
https://github.com/apache/airflow/issues/55768#issuecomment-3402928673
   Memory occupancy continues to rise in scheduler containers
   
   ### What you think should happen instead?
   
   Memory usage must not rise in airflow deployed by docker compose.
   
   
   ### How to reproduce
   
   Test enviroment:
   - airflow 3.1.0 official docker image
   - deployed by docker compose (api-server, scheduler, dag-processor)
   - 100 Dags with 5 PythonOperator running every minute
   
   ### Operating System
   
   docker container operated in macOs 
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Docker-Compose
   
   ### Deployment details
   
   ````
   x-airflow-common:
     &airflow-common
     image: apache/airflow:3.1.0
     environment:
       &airflow-common-env
       AIRFLOW__CORE__EXECUTOR: LocalExecutor
       AIRFLOW__CORE__AUTH_MANAGER: 
airflow.providers.fab.auth_manager.fab_auth_manager.FabAuthManager
       AIRFLOW__DATABASE__SQL_ALCHEMY_CONN: 
postgresql+psycopg2://airflow:airflow@postgres/airflow
       AIRFLOW__CORE__FERNET_KEY: ''
       AIRFLOW__CORE__LOAD_EXAMPLES: 'false'
       AIRFLOW__CORE__EXECUTION_API_SERVER_URL: 
'http://airflow-apiserver:8080/execution/'
       AIRFLOW__API__SECRET_KEY: 'abc'
       AIRFLOW__API_AUTH__JWT_SECRET: 'asdasd'
       AIRFLOW__SCHEDULER__ENABLE_TRACEMALLOC: 'false'
     volumes:
       - ${AIRFLOW_PROJ_DIR:-.}/dags:/opt/airflow/dags
     depends_on:
       &airflow-common-depends-on
       postgres:
         condition: service_healthy
   
   services:
     postgres:
       image: postgres:13
       ports:
         - "5432:5432"
       environment:
         POSTGRES_USER: airflow
         POSTGRES_PASSWORD: airflow
         POSTGRES_DB: airflow
       healthcheck:
         test: ["CMD", "pg_isready", "-U", "airflow"]
         interval: 10s
         retries: 5
         start_period: 5s
       restart: always
   
     airflow-init:
       <<: *airflow-common
       entrypoint: /bin/bash
       command:
         - -c
         - |
           echo "Creating missing opt dirs if missing:"
           mkdir -v -p /opt/airflow/{logs,dags,plugins,config}
           echo "Airflow version:"
           /entrypoint airflow version
           echo "Running airflow config list to create default config file if 
missing."
           /entrypoint airflow config list >/dev/null
           echo "Change ownership of files in /opt/airflow to ${AIRFLOW_UID}:0"
           chown -R "${AIRFLOW_UID}:0" /opt/airflow/
       environment:
         <<: *airflow-common-env
         _AIRFLOW_DB_MIGRATE: 'true'
         _AIRFLOW_WWW_USER_CREATE: 'true'
         _AIRFLOW_WWW_USER_USERNAME: ${_AIRFLOW_WWW_USER_USERNAME:-airflow}
         _AIRFLOW_WWW_USER_PASSWORD: ${_AIRFLOW_WWW_USER_PASSWORD:-airflow}
         _PIP_ADDITIONAL_REQUIREMENTS: ''
       user: "0:0"
       depends_on:
         <<: *airflow-common-depends-on
   
     airflow-apiserver:
       <<: *airflow-common
       command: api-server
       ports:
         - "8080:8080"
       healthcheck:
         test: ["CMD", "curl", "--fail", "http://localhost:8080/api/v2/version";]
         interval: 30s
         timeout: 10s
         retries: 5
         start_period: 30s
       restart: always
       depends_on:
         <<: *airflow-common-depends-on
         airflow-init:
           condition: service_completed_successfully
   
     airflow-scheduler:
       <<: *airflow-common
       command: scheduler
       healthcheck:
         test: ["CMD", "airflow", "jobs", "check", "--job-type", "SchedulerJob"]
         interval: 30s
         timeout: 10s
         retries: 5
       restart: always
       depends_on:
         postgres:
           condition: service_healthy
         airflow-init:
           condition: service_completed_successfully
   
   
     airflow-dag-processor:
       <<: *airflow-common
       command: dag-processor
       healthcheck:
         test: ["CMD-SHELL", 'airflow jobs check --job-type DagProcessorJob 
--hostname "$${HOSTNAME}"']
         interval: 30s
         timeout: 10s
         retries: 5
         start_period: 30s
       restart: always
       depends_on:
         <<: *airflow-common-depends-on
         airflow-init:
           condition: service_completed_successfully
   ````
   
   ### Anything else?
   
   As mentioned earlier 
https://github.com/apache/airflow/issues/55768#issuecomment-3374894204, the 
observed memory increase originates from both the scheduler process and its 
subprocesses — the LocalExecutor workers.
   The scheduler’s own memory growth has already been analyzed and discussed by 
@kaxil https://github.com/apache/airflow/issues/55768#issuecomment-3353598174, 
so I will not cover it here.
   
   When running with the LocalExecutor, the default number of worker processes 
is 32.
   Since any memory increase per worker is multiplied across all 32 workers, 
even small leaks can have a critical impact on overall memory usage.
   
   I used Memray to analyze the worker processes (which are child processes of 
the scheduler) and identified three main causes of excessive memory allocation 
within them.
   
   ### 1. Importing the k8s client object
   
   First, here is the result of analyzing a single worker process:
   
[memray-flamegraph-output-111.html](https://github.com/user-attachments/files/22916972/memray-flamegraph-output-111.html)
   
   [In this 
section](https://github.com/apache/airflow/blob/main/shared/secrets_masker/src/airflow_shared/secrets_masker/secrets_masker.py#L164C10-L164C47),
 I confirmed that approximately 32 MB of memory is allocated per worker.
   Although the code only appears to reference the object’s type, it actually 
triggers imports of all underlying submodules.
   Since each worker imports these modules independently, this results in an 
additional ~1 GB of total memory allocation across all workers.
   
   ### 2. Increasing memory from client SSL objects
   
   After modifying the problematic code in (1) to prevent the import, I ran 
memory profiling again.
   While the initial memory footprint per worker was significantly reduced, I 
still observed gradual memory growth over time. (0928 means the stats is 
reported in 09:28)
   
[remove-k8s-0928.html](https://github.com/user-attachments/files/22916986/remove-k8s-0928.html)
   
[remove-k8s-1001.html](https://github.com/user-attachments/files/22916988/remove-k8s-1001.html)
   
[remove-k8s-1035.html](https://github.com/user-attachments/files/22916990/remove-k8s-1035.html)
   
   [In the following 
section](https://github.com/apache/airflow/blob/main/task-sdk/src/airflow/sdk/api/client.py#L828),
 the SSL initialization appears not to properly release memory.
   Within about 30 minutes, a single worker’s memory grew from 8 MB → 23 MB, 
later exceeding 50 MB, and continued to increase steadily thereafter.
   
   ### 3. Memory inheritance from the parent process due to lazy forking
   
   After addressing issues (1) and (2), I verified that the overall memory 
consumption remained stable and did not exhibit continuous growth.
   However, I noticed that while initial PSS values were low, they gradually 
increased to relatively high levels over time.  
   
[memory_smem.txt](https://github.com/user-attachments/files/22917005/memory_smem.txt)
   
   It was difficult to track the exact distribution using Memray due to 
extensive shared memory usage — very little heap memory remained in the workers 
themselves.
   
   My hypothesis is as follows:
   Unlike Airflow 2.x, version 3.x introduced lazy worker initialization.
   As a result, when the scheduler (already holding significant memory) forks a 
new worker, Copy-on-Write (CoW) causes shared pages to be duplicated across 
workers, leading to increased per-process memory consumption.
   
   ### Conclusion
   
   To verify this hypothesis, I modified the code to eagerly spawn worker 
processes before the scheduler enters its scheduling loop — effectively 
disabling lazy forking.
   The experiment showed that worker memory usage remained stable and no longer 
exhibited the previous pattern of gradual growth.
   
   
[memory_smem_2.txt](https://github.com/user-attachments/files/22917012/memory_smem_2.txt)
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to