[DISCUSS] Docker runtime isolation for airflow tasks

Ping Zhang Thu, 16 Dec 2021 15:53:06 -0800

Hi Airflow Community,

This is Ping Zhang from the Airbnb Airflow team.  We would like to open
source our internal feature: docker runtime isolation for airflow tasks. It
has been in our production for close to 1 year and it is very stable.


I will create an AIP after the discussion.

Thanks,

Ping


Motivation

Airflow worker host is a shared resource among all tasks running on it.
Thus, it requires hosts to provision dependencies for all tasks, including
system and python application level dependencies. It leads to a very fat
runtime, thus long host provision time and low elasticity in the worker
resource. This makes it challenging to prepare for unexpected burst load,
including a large backfill or a rerun of large DAGs.

The lack of runtime isolation makes it challenging and risky to do
operations, including adding/upgrading system and python dependencies, and
it is almost impossible to remove any dependencies. It also incurs lots of
additional operating costs for the team as users do not have permission to
add/upgrade python dependencies, which requires us to coordinate with them.
When there are package version conflicts, it prevents installing them
directly on the host. Users have to use PythonVirtualenvOperator, which
slows down their development cycle.

What change do you propose to make?

To solve those problems, we propose introducing runtime isolation for
Airflow tasks. It leverages docker as the tasks runtime environment. There
are several benefits:

   1.

   Provide runtime isolation on task level
   2.

   Customize runtime to parse dag files
   3.

   Lean runtime on airflow host, which enables high worker resource
   elasticity
   4.

   Immutable and portable task execution untime
   5.

   Process isolation ensures that all subprocesses of a task are cleaned up
   after docker exits (we have seen some orphaned hive, spark subprocesses
   after the airflow run process exits)

ChangesAirflow Worker

In the new design, the `airflow run local` and `airflow run raw` processes
are running inside a docker container, which is launched by an airflow
worker. In this way, the airflow worker runtime only needs minimum
requirements to run airflow core and docker.
Airflow Scheduler

Instead of processing the DAG file directly, the DagFileProcessor process

   1.

   launches a docker container required by that DAG file to process it and
   persists the serializable DAGs (SimpleDags) to a file so that the result
   can be read outside the docker container
   2.

   reads the file persisted from the docker container, deserializes it and
   puts the result into the multiprocess queue


This ensures the DAG parsing runtime is exactly the same as DAG execution
runtime.

This requires a DAG definition file to tell the DAG file processing loop to
use which docker image to process it. We can easily achieve this by having
a metadata file along with the DAG definition file to define the docker
runtime. To ease the burden of users, a default docker image is provided
when a DAG definition file does not require customized runtime.
As a Whole






Best wishes

Ping Zhang

[DISCUSS] Docker runtime isolation for airflow tasks

Reply via email to