Hi Airflow Community,

We (Airbnb) would like to open source an internal developed feature:
streaming deserialize the airflow task in airflow run process to avoid dag
parsing.

This has been running in our prod for around 3 quarters and it is very
stable. It helped to bring down average task start time from 24 seconds to
3 seconds (across all 3 airflow clusters) and reduce peak memory usage by
around 45%

I will create an AIP after the discussion.

Thanks,

Ping

Motivation

To run airflow tasks, airflow needs to parse dag file twice, once in
airflow run local process, once in airflow run raw. This is a waste of
memory, doubling the dag parsing ram. During the peak hour, the CPU can
spike for a long time due to double parsing, thus slowing down the task
starting time , even causing dag parsing timeout. This is especially true
when there are complex large dags in the airflow cluster.

What change do you propose to make?

The dag parsing in the `airflow run local` can be removed by reading the
dag from the serialized_dag table. Also, the memory usage can be reduced by
only deserializing the target task in a streaming way instead of
deserializing the whole dag object.

Reply via email to