GitHub user erl987 edited a discussion: Airflow with CeleryExecutor becomes 
extremely slow when executing a long-running task

# Task to solve with Airflow
Provide a pipeline that is triggered by external events (through the Airflow 
REST-API), then a main simulation task is executed taking 1–5 hours. Afterward 
a number of post-processing steps are performed on the simulation results and 
in a final task the results are uploaded to a database. The pipeline will be 
triggered several times per hour. It should be as standard as possible to allow 
for operation and maintenance by non-simulation domain staff, and the 
post-processing tasks should be flexible to be changed without bothering about 
the simulation task details. The number of workers should be easy to change, 
possibly using multiple machines.

For me, `Airflow` with `Celery` seems to fulfill all these requirements, and 
therefore I am trying to set up a demonstration system.

# Setup
* latest `Airflow==2.10.3` with `CeleryExecutor`
* `docker-compose` stack (from here: 
https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html#fetching-docker-compose-yaml)
* Ubuntu 24.04 LTS VM in the Azure cloud (16 GB memory, ca. 100 GB disk)

# Problem
The problem I experience is the follows:
* trigger a single execution of a long-running simulation task (> 1 hour, some 
GB memory consumption)
* first, the task is executed normally, only the worker process constantly 
takes 100% single CPU load
* **after ca. one hour, the `triggerer` container suddenly logs a lot of 
messages like this: `INFO - Triggerer's async thread was blocked for 0.27 
seconds, likely a badly-written trigger.`**
* the memory of the machine is still at only ca. 80% (of 16 GB) at that time, 
the CPU load at ca. 50%
* no problems are reported on the webpage, the logs from the task still 
progressing
* there are a lot of `celery` and `airflow` processes that take a CPU high 
load, **the actual worker process only has a low CPU load and continues 
extremely slowly**, also the webpage is very sluggish 

## DAG to reproduce the problem:
```python
from datetime import timedelta

from airflow import DAG
from airflow.decorators import task

default_args = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email_on_failure': False,
    'email_on_retry': False,
    'retries': 1,
    'retry_delay': timedelta(minutes=1),
}

dag = DAG(
    'task_mock',
    default_args=default_args,
    description='A mock for resource intensive calculations',
    schedule_interval=None,
    start_date=None,
    tags=['mock'],
)


@task(task_id="resource_intensive_calculation_mock", dag=dag)
def task_mock():
    size_in_gb = 8
    large_array = allocate_large_memory(size_in_gb)
    if large_array is not None:
        process_indefinitely(large_array)


def allocate_large_memory(size_in_gb):
    import numpy as np

    # Convert GB to number of float64 elements (1 float64 is 8 bytes)
    num_elements = size_in_gb * (1024 ** 3) // 8
    try:
        # Allocate a large numpy array
        large_array = np.zeros(num_elements, dtype=np.float64)
        print(f"Allocated array with {num_elements} elements (~{size_in_gb} 
GB).")
        return large_array
    except MemoryError:
        print("Memory allocation failed. Reduce the size.")
        return None


def process_indefinitely(arr):
    import numpy as np

    iteration = 0
    while True:
        # Perform some operations on the array
        arr += 1
        arr *= 2
        arr /= 2
        arr -= 1

        iteration += 1
        if iteration % 10 == 0:
            print(f"Iteration {iteration}: Array sum = {np.sum(arr)}")


task_mock()
```

This is a mock task reproducing the behavior of the real-world simulation code 
in terms of memory (here 8 GB, but it also happens with less) and 
CPU-consumption.

# Questions
* Is it wise to use Airflow in such a use-case?
* Should I consider an alternative, if so which?
* This behavior feels for me like a bug?
* Any workarounds to mitigate the problem?

I am happy to provide more details if required, but I think that this DAG 
should be reproducible with the standard Docker Compose stack.
<img width="844" alt="Screenshot 2024-11-27 at 19 28 20" 
src="https://github.com/user-attachments/assets/232480f1-6b0c-4521-80cc-313034791ada";>
<img width="837" alt="Screenshot 2024-11-27 at 19 28 31" 
src="https://github.com/user-attachments/assets/814899f6-51fe-4c13-baf2-5a5301fd59b1";>



GitHub link: https://github.com/apache/airflow/discussions/44430

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to