Re: [DISCUSS] Docker runtime isolation for airflow tasks

Jarek Potiuk Fri, 17 Dec 2021 12:42:57 -0800

Yeah. I think for sure "Docker" as a "common execution environment" is
convenient in certain situations. But for sure it should not be the default
as mentioned before (as much as I love containers I also know - from the
surveys we run for one but also from interacting with many users of Airflow
- for many of our users containers are not "default" way of doing things
(and we should embrace it).

I see some of the multi-tenancy deployment in the future could benefit from
having different sets of dependencies - both for parsing and execution (say
one environment per team). I think the current AIP-43 proposal
already handles a big part of it. You could have - dockerised or not -
different environments to parse your dags in per different "subfolder" - so
each DagProcessor for the sub-folder could have a different set of
dependencies (either coming from the virtualenv or Docker). This all
without putting the "requirement" on using Docker. I think we are well
aligned on the goal. I see that as the choice:

a) whether Airflow should be able to choose the run "environment" to parse
a Dag based on meta-data (which your proposal is about)
b) whether each DagProcessor (different per team) should be started in the
right "environment" to begin with (which is actually made possible by
AIP-43) - which is part of Deployment not Airflow code.

I think we should go with b) first and if b) will not be enough, future AIP
to implement a) is also possible.

When it comes to task execution - this is something we can definitely
discuss in the future - as the next AIP. We should think about how to
seamlessly map an execution of a task into different environments. We can
definitely make it a point for discussion in the Jan meeting I plan
to have.

As Ash mentioned - we already have some ways to do it (Celery Queues, K8S
executor, but also Airflow 2 @task.docker decorator). I am sure we can
discuss those and see how we can address it after AIP-43/44 are
discussed/approved and see if it makes sense to add another way.

J.

On Fri, Dec 17, 2021 at 11:53 AM Alexander Shorin <[email protected]> wrote:

> How should your idea work on systems without docker? Like FreeBSD? And why
> you made such leaky tasks which couldn't be isolated with common tools like
> system packages, venv, etc.
>
> --
> ,,,^..^,,,
>
>
> On Fri, Dec 17, 2021 at 2:53 AM Ping Zhang <[email protected]> wrote:
>
>> Hi Airflow Community,
>>
>> This is Ping Zhang from the Airbnb Airflow team.  We would like to open
>> source our internal feature: docker runtime isolation for airflow tasks. It
>> has been in our production for close to 1 year and it is very stable.
>>
>> I will create an AIP after the discussion.
>>
>> Thanks,
>>
>> Ping
>>
>>
>> Motivation
>>
>> Airflow worker host is a shared resource among all tasks running on it.
>> Thus, it requires hosts to provision dependencies for all tasks, including
>> system and python application level dependencies. It leads to a very fat
>> runtime, thus long host provision time and low elasticity in the worker
>> resource. This makes it challenging to prepare for unexpected burst load,
>> including a large backfill or a rerun of large DAGs.
>>
>> The lack of runtime isolation makes it challenging and risky to do
>> operations, including adding/upgrading system and python dependencies, and
>> it is almost impossible to remove any dependencies. It also incurs lots of
>> additional operating costs for the team as users do not have permission to
>> add/upgrade python dependencies, which requires us to coordinate with them.
>> When there are package version conflicts, it prevents installing them
>> directly on the host. Users have to use PythonVirtualenvOperator, which
>> slows down their development cycle.
>>
>> What change do you propose to make?
>>
>> To solve those problems, we propose introducing runtime isolation for
>> Airflow tasks. It leverages docker as the tasks runtime environment. There
>> are several benefits:
>>
>>    1.
>>
>>    Provide runtime isolation on task level
>>    2.
>>
>>    Customize runtime to parse dag files
>>    3.
>>
>>    Lean runtime on airflow host, which enables high worker resource
>>    elasticity
>>    4.
>>
>>    Immutable and portable task execution untime
>>    5.
>>
>>    Process isolation ensures that all subprocesses of a task are cleaned
>>    up after docker exits (we have seen some orphaned hive, spark subprocesses
>>    after the airflow run process exits)
>>
>> ChangesAirflow Worker
>>
>> In the new design, the `airflow run local` and `airflow run raw`
>> processes are running inside a docker container, which is launched by an
>> airflow worker. In this way, the airflow worker runtime only needs minimum
>> requirements to run airflow core and docker.
>> Airflow Scheduler
>>
>> Instead of processing the DAG file directly, the DagFileProcessor process
>>
>>    1.
>>
>>    launches a docker container required by that DAG file to process it
>>    and persists the serializable DAGs (SimpleDags) to a file so that the
>>    result can be read outside the docker container
>>    2.
>>
>>    reads the file persisted from the docker container, deserializes it
>>    and puts the result into the multiprocess queue
>>
>>
>> This ensures the DAG parsing runtime is exactly the same as DAG execution
>> runtime.
>>
>> This requires a DAG definition file to tell the DAG file processing loop
>> to use which docker image to process it. We can easily achieve this by
>> having a metadata file along with the DAG definition file to define the
>> docker runtime. To ease the burden of users, a default docker image is
>> provided when a DAG definition file does not require customized runtime.
>> As a Whole
>>
>>
>>
>>
>>
>>
>> Best wishes
>>
>> Ping Zhang
>>
>

Re: [DISCUSS] Docker runtime isolation for airflow tasks

Reply via email to