Re: [DISCUSS] Proposal around the Injection of Task Execution Secrets

Jarek Potiuk Fri, 14 Jun 2024 11:57:46 -0700

First pass done - especially around security aspects of it, Looks great.

On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org> wrote:


> I’ve written up a lot more of the implementation details into an AIP
> https://cwiki.apache.org/confluence/x/xgmTEg
>
> It’s still marked as Draft/Work In Progress for now as there are few
> details we know we need to cover before the doc is complete.
>
> (There was also some discussion in the dev call about a different name for
> this AIP)
>
> > On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote:
> >
> >> IMHO - if we do not want to support DB access at all from workers,
> > triggerrers and DAG file processors, we should replace the current "DB"
> > bound interface with a new one specifically designed for this
> > bi-directional direct communication Executor <-> Workers,
> >
> > That is exactly what I was thinking too (both that no DB should be the
> only option in v3, and that we need a bidirectional purpose designed
> interface) and am working up the details.
> >
> > One of the key features of this will be giving each task try a "strong
> identity" that the API server can use to identify and trust the requests,
> likely some form of signed JWT.
> >
> > I just need to finish off some other work before I can move over to
> focus Airflow fully
> >
> > -a
> >
> > On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote:
> >> I added some comments here and I think there is one big thing  that
> should
> >> be clarified when we get to "task isolation" - mainly dependance of it
> on
> >> AIP-44.
> >>
> >> The Internal gRPC API (AIP-44) was only designed in the way it was
> designed
> >> to allow using the same codebase to be used with/without DB. It's based
> on
> >> the assumption that a limited set of changes will be needed (that was
> >> underestimated) in order to support both DB and GRPC ways of
> communication
> >> between workers/triggerers/DAG file processors at the same time. That
> was a
> >> basic assumption for AIP-44 - that we will want to keep both ways and
> >> maximum backwards compatibility (including "pull" model of worker
> getting
> >> connections, variables, and updating task state in the Airflow DB). We
> are
> >> still using "DB" as a way to communicate between those components and
> this
> >> does not change with AIP-44.
> >>
> >> But for Airflow 3 the whole context is changed. If we go with the
> >> assumption that Airflow 3 will only have isolated tasks and no DB
> "option",
> >> I personally think using AIP-44 for that is a mistake. AIP-44 is merely
> a
> >> wrapper over existing DB calls designed to be kept updated together with
> >> the DB code, and the whole synchronisation of state, heartbeats,
> variables
> >> and connection access still uses the same "DB communication" model and
> >> there is basically no way we can get it more scalable this way. We will
> >> still have the same limitations on the DB - where a number of DB
> >> connections will be replaced with a number of GRPC connections,
> Essentially
> >> - more scalability and performance has never been the goal of AIP-44-
> all
> >> the assumptions are that it only brings isolation but nothing more will
> >> change. So I think it does not address some of the fundamental problems
> >> stated in this "isolation" document.
> >>
> >> Essentially AIP-44 merely exposes a small-ish number of methods (bigger
> >> than initially anticipated) but it only wraps around the existing DB
> >> mechanism. Essentially from the performance and scalability point of
> view -
> >> we do not get much more than currently when using pgbouncer. This one
> >> essentially turns a big number of connections coming from workers into a
> >> smaller number of pooled connections that pgbounder manages internal and
> >> multiplexes the calls over. With the difference that unlike AIP-44
> Internal
> >> API server, pgbouncer does not limit the operations you can do from the
> >> worker/triggerer/dag file processor - that's the main difference between
> >> using pgbouncer and using our own Internal-API server.
> >>
> >> IMHO - if we do not want to support DB access at all from workers,
> >> triggerrers and DAG file processors, we should replace the current "DB"
> >> bound interface with a new one specifically designed for this
> >> bi-directional direct communication Executor <-> Workers, more in line
> with
> >> what Jens described in AIP-69 (and for example WebSocket and
> asynchronous
> >> communication comes immediately to my mind if I did not have to use DB
> for
> >> that communication). This is also why I put the AIP-67 on hold because
> IF
> >> we go that direction that we have "new" interface between worker,
> triggerer
> >> , dag file processor - it might be way easier (and safer) to introduce
> >> multi-team in Airflow 3 rather than 2 (or we can implement it
> differently
> >> in Airflow 2 and differently in Airflow 3).
> >>
> >>
> >>
> >> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka <vik...@astronomer.io.invalid
> >
> >> wrote:
> >>
> >>> Fellow Airflowers,
> >>>
> >>> I am following up on some of the proposed changes in the Airflow 3
> proposal
> >>> <
> >>>
> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
> >>>> ,
> >>> where more information was requested by the community, specifically
> around
> >>> the injection of Task Execution Secrets. This topic has been discussed
> at
> >>> various times with a variety of names, but here is a holistic proposal
> >>> around the whole task context mechanism.
> >>>
> >>> This is not yet a full fledged AIP, but is intended to facilitate a
> >>> structured discussion, which will then be followed up with a formal AIP
> >>> within the next two weeks. I have included most of the text here, but
> >>> please give detailed feedback in the attached document
> >>> <
> >>>
> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/
> >>>> ,
> >>> so that we can have a contextual discussion around specific points
> which
> >>> may need more detail.
> >>> ---
> >>> Motivation
> >>>
> >>> Historically, Airflow’s task execution context has been oriented around
> >>> local execution within a relatively trusted networking cluster.
> >>>
> >>> This includes:
> >>>
> >>>   -
> >>>
> >>>   the interaction between the Executor and the process of launching a
> task
> >>>   on Airflow Workers,
> >>>   -
> >>>
> >>>   the interaction between the Workers and the Airflow meta-database for
> >>>   connection and environment information as part of initial task
> startup,
> >>>   -
> >>>
> >>>   the interaction between the Airflow Workers and the rest of Airflow
> for
> >>>   heartbeat information, and so on.
> >>>
> >>> This has been accomplished by colocating all of the Airflow task
> execution
> >>> code with the user task code in the same container and process.
> >>>
> >>>
> >>>
> >>> For Airflow users at scale i.e. supporting multiple data teams, this
> has
> >>> posed many operational challenges:
> >>>
> >>>   -
> >>>
> >>>   Dependency conflicts for administrators supporting data teams using
> >>>   different versions of providers, libraries, or python packages
> >>>   -
> >>>
> >>>   Security challenge in the running of customer-defined code (task code
> >>>   within the DAGs) for multiple customers within the same operating
> >>>   environment and service accounts
> >>>   -
> >>>
> >>>   Scalability of Airflow since one of the core Airflow scalability
> >>>   limitations has been the number of concurrent database connections
> >>>   supported by the underlying database instance. To alleviate this
> >>> problem,
> >>>   we have consistently, as an Airflow community, recommended the use of
> >>>   PgBouncer for connection pooling, as part of an Airflow deployment.
> >>>   -
> >>>
> >>>   Operational issues caused by unintentional reliance on internal
> Airflow
> >>>   constructs within the DAG/Task code, which only and unexpectedly show
> >>> up as
> >>>   part of Airflow production operations, coincidentally with, but not
> >>> limited
> >>>   to upgrades and migrations.
> >>>   -
> >>>
> >>>   Operational management based on the above for Airflow platform teams
> at
> >>>   scale, because different data teams naturally operate at different
> >>>   velocities. Attempting to support these different teams with a common
> >>>   Airflow environment is unnecessarily challenging.
> >>>
> >>>
> >>>
> >>> The internal API to reduce the need for interaction between the Airflow
> >>> Workers and the metadatabase is a big and necessary step forward.
> However,
> >>> it doesn’t fully address the above challenges. The proposal below
> builds on
> >>> the internal API proposal and goes significantly further to not only
> >>> address these challenges above, but also enable the following key use
> >>> cases:
> >>>
> >>>   1.
> >>>
> >>>   Ensure that this interface reduces the interaction between the code
> >>>   running within the Task and the rest of Airflow. This is to address
> >>>   unintended ripple effects from core Airflow changes which has caused
> >>>   numerous Airflow upgrade issues, because Task (i.e. DAG) code relied
> on
> >>>   Core Airflow abstractions. This has been a common problem pointed
> out by
> >>>   numerous Airflow users including early adopters.
> >>>   2.
> >>>
> >>>   Enable quick, performant execution of tasks on local, trusted
> networks,
> >>>   without requiring the Airflow workers / tasks to connect to the
> Airflow
> >>>   database to obtain all the information required for task startup,
> >>>   3.
> >>>
> >>>   Enable remote execution of Airflow tasks across network boundaries,
> by
> >>>   establishing a clean interface for Airflow workers on remote networks
> >>> to be
> >>>   able to connect back to a central Airflow service to access all
> >>> information
> >>>   needed for task execution. This is foundational work for remote
> >>> execution.
> >>>   4.
> >>>
> >>>   Enable a clean language agnostic interface for task execution, with
> >>>   support for multiple language bindings, so that Airflow tasks can be
> >>>   written in languages beyond Python.
> >>>
> >>> Proposal
> >>>
> >>> The proposal here has multiple parts as detailed below.
> >>>
> >>>   1.
> >>>
> >>>   Formally split out the Task Execution Interface as the Airflow Task
> SDK
> >>>   (possibly name it as the Airflow SDK), which would be the only
> >>> interface to
> >>>   and from Airflow Task User code to the Airflow system components
> >>> including
> >>>   the meta-database, Airflow Executor, etc.
> >>>   2.
> >>>
> >>>   Disable all direct database interaction from the Airflow Workers
> >>>   including Tasks being run on those Airflow Workers and the Airflow
> >>>   meta-database.
> >>>   3.
> >>>
> >>>   The Airflow Task SDK will include interfaces for:
> >>>   -
> >>>
> >>>      Access to needed Airflow Connections, Variables, and XCom values
> >>>      -
> >>>
> >>>      Report heartbeat
> >>>      -
> >>>
> >>>      Record logs
> >>>      -
> >>>
> >>>      Report metrics
> >>>      4.
> >>>
> >>>   The Airflow Task SDK will support a Push mechanism for speedy local
> >>>   execution in trusted environments.
> >>>   5.
> >>>
> >>>   The Airflow Task SDK will also support a Pull mechanism for the
> remote
> >>>   Task execution environments to access information from an Airflow
> >>> instance
> >>>   over network boundaries.
> >>>   6.
> >>>
> >>>   The Airflow Task SDK will be designed to support multiple language
> >>>   bindings, with the first language binding of course being Python.
> >>>
> >>>
> >>> Assumption: The existing AIP for Internal API covers the interaction
> >>> between the Airflow workers and Airflow metadatabase for heartbeat
> >>> information, persisting XComs, and so on.
> >>> --
> >>>
> >>> Best regards,
> >>>
> >>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau
> >>>
>
>

Re: [DISCUSS] Proposal around the Injection of Task Execution Secrets

Reply via email to