Hi Ash,

Thanks for the inputs. I should have specially called out that the docker
runtime is an add-on feature that is controlled by a feature flag.

Users/infra team can choose to enable it or not. When not enabled, it stays
with the current behavior.

This docker runtime feature has helped a lot during our py3 upgrade
project. With this, we just built a py3 docker image to run tasks and parse
dags without needing to spin up a new airflow cluster.

Best wishes

Ping Zhang


On Fri, Dec 17, 2021 at 2:31 AM Ash Berlin-Taylor <[email protected]> wrote:

> Hi Ping,
>
> (The dev list doesn't allow attachments, so we can't see any of the images
> you've posted, so some of my questions might have been addressed by those
> images.)
>
> It seems that a lot of the goals here are overlapping with the AIP-1 and
> proposed separation of dag processor from scheduler and multi-tenancy work
> in general.
> Your description of how the scheduler and DAG parsing process operate is
> based on 1.10 mode of operation, but that has changed in 2.0 -- the
> scheduler _only_ operates on the serialized representation and doesn't need
> the result of the dag parsing process. Breaking this tight coupling was one
> of the major speed ups I achieved.
>
> It's not clear from your email the exact details yet, but my initial
> comments:
>
> 1. Runtime isolation of task execution is already possible by using the
> KubernetesExecutor
>
> 2. Running short-lived process (such as what I think you are proposing for
> dag parsing) in a Kube cluster isn't really practical as the spin up time
> of pods is highly variable and can be to the order of minutes
>
> 3. Not everyone has docker available or is comfortable running it -- we
> 100% need to support running without Docker or containers still.
>
> 4. Many of our users are Data Scientists or Engineers, and so aren't happy
> with building containers.
>
> On Thu, Dec 16 2021 at 15:52:02 -0800, Ping Zhang <[email protected]>
> wrote:
>
> Hi Airflow Community,
>
> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
> source our internal feature: docker runtime isolation for airflow tasks. It
> has been in our production for close to 1 year and it is very stable.
>
> I will create an AIP after the discussion.
>
> Thanks,
>
> Ping
>
>
> Motivation
>
> Airflow worker host is a shared resource among all tasks running on it.
> Thus, it requires hosts to provision dependencies for all tasks, including
> system and python application level dependencies. It leads to a very fat
> runtime, thus long host provision time and low elasticity in the worker
> resource. This makes it challenging to prepare for unexpected burst load,
> including a large backfill or a rerun of large DAGs.
>
> The lack of runtime isolation makes it challenging and risky to do
> operations, including adding/upgrading system and python dependencies, and
> it is almost impossible to remove any dependencies. It also incurs lots of
> additional operating costs for the team as users do not have permission to
> add/upgrade python dependencies, which requires us to coordinate with them.
> When there are package version conflicts, it prevents installing them
> directly on the host. Users have to use PythonVirtualenvOperator, which
> slows down their development cycle.
>
> What change do you propose to make?
>
> To solve those problems, we propose introducing runtime isolation for
> Airflow tasks. It leverages docker as the tasks runtime environment. There
> are several benefits:
>
>    1.
>
>    Provide runtime isolation on task level
>    2.
>
>    Customize runtime to parse dag files
>    3.
>
>    Lean runtime on airflow host, which enables high worker resource
>    elasticity
>    4.
>
>    Immutable and portable task execution untime
>    5.
>
>    Process isolation ensures that all subprocesses of a task are cleaned
>    up after docker exits (we have seen some orphaned hive, spark subprocesses
>    after the airflow run process exits)
>
> ChangesAirflow Worker
>
> In the new design, the `airflow run local` and `airflow run raw`
> processes are running inside a docker container, which is launched by an
> airflow worker. In this way, the airflow worker runtime only needs minimum
> requirements to run airflow core and docker.
> Airflow Scheduler
>
> Instead of processing the DAG file directly, the DagFileProcessor process
>
>    1.
>
>    launches a docker container required by that DAG file to process it
>    and persists the serializable DAGs (SimpleDags) to a file so that the
>    result can be read outside the docker container
>    2.
>
>    reads the file persisted from the docker container, deserializes it
>    and puts the result into the multiprocess queue
>
>
> This ensures the DAG parsing runtime is exactly the same as DAG execution
> runtime.
>
> This requires a DAG definition file to tell the DAG file processing loop
> to use which docker image to process it. We can easily achieve this by
> having a metadata file along with the DAG definition file to define the
> docker runtime. To ease the burden of users, a default docker image is
> provided when a DAG definition file does not require customized runtime.
> As a Whole
>
>
>
>
>
>
> Best wishes
>
> Ping Zhang
>
>

Reply via email to