First pass done - especially around security aspects of it, Looks great. On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> I’ve written up a lot more of the implementation details into an AIP > https://cwiki.apache.org/confluence/x/xgmTEg > > It’s still marked as Draft/Work In Progress for now as there are few > details we know we need to cover before the doc is complete. > > (There was also some discussion in the dev call about a different name for > this AIP) > > > On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote: > > > >> IMHO - if we do not want to support DB access at all from workers, > > triggerrers and DAG file processors, we should replace the current "DB" > > bound interface with a new one specifically designed for this > > bi-directional direct communication Executor <-> Workers, > > > > That is exactly what I was thinking too (both that no DB should be the > only option in v3, and that we need a bidirectional purpose designed > interface) and am working up the details. > > > > One of the key features of this will be giving each task try a "strong > identity" that the API server can use to identify and trust the requests, > likely some form of signed JWT. > > > > I just need to finish off some other work before I can move over to > focus Airflow fully > > > > -a > > > > On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> wrote: > >> I added some comments here and I think there is one big thing that > should > >> be clarified when we get to "task isolation" - mainly dependance of it > on > >> AIP-44. > >> > >> The Internal gRPC API (AIP-44) was only designed in the way it was > designed > >> to allow using the same codebase to be used with/without DB. It's based > on > >> the assumption that a limited set of changes will be needed (that was > >> underestimated) in order to support both DB and GRPC ways of > communication > >> between workers/triggerers/DAG file processors at the same time. That > was a > >> basic assumption for AIP-44 - that we will want to keep both ways and > >> maximum backwards compatibility (including "pull" model of worker > getting > >> connections, variables, and updating task state in the Airflow DB). We > are > >> still using "DB" as a way to communicate between those components and > this > >> does not change with AIP-44. > >> > >> But for Airflow 3 the whole context is changed. If we go with the > >> assumption that Airflow 3 will only have isolated tasks and no DB > "option", > >> I personally think using AIP-44 for that is a mistake. AIP-44 is merely > a > >> wrapper over existing DB calls designed to be kept updated together with > >> the DB code, and the whole synchronisation of state, heartbeats, > variables > >> and connection access still uses the same "DB communication" model and > >> there is basically no way we can get it more scalable this way. We will > >> still have the same limitations on the DB - where a number of DB > >> connections will be replaced with a number of GRPC connections, > Essentially > >> - more scalability and performance has never been the goal of AIP-44- > all > >> the assumptions are that it only brings isolation but nothing more will > >> change. So I think it does not address some of the fundamental problems > >> stated in this "isolation" document. > >> > >> Essentially AIP-44 merely exposes a small-ish number of methods (bigger > >> than initially anticipated) but it only wraps around the existing DB > >> mechanism. Essentially from the performance and scalability point of > view - > >> we do not get much more than currently when using pgbouncer. This one > >> essentially turns a big number of connections coming from workers into a > >> smaller number of pooled connections that pgbounder manages internal and > >> multiplexes the calls over. With the difference that unlike AIP-44 > Internal > >> API server, pgbouncer does not limit the operations you can do from the > >> worker/triggerer/dag file processor - that's the main difference between > >> using pgbouncer and using our own Internal-API server. > >> > >> IMHO - if we do not want to support DB access at all from workers, > >> triggerrers and DAG file processors, we should replace the current "DB" > >> bound interface with a new one specifically designed for this > >> bi-directional direct communication Executor <-> Workers, more in line > with > >> what Jens described in AIP-69 (and for example WebSocket and > asynchronous > >> communication comes immediately to my mind if I did not have to use DB > for > >> that communication). This is also why I put the AIP-67 on hold because > IF > >> we go that direction that we have "new" interface between worker, > triggerer > >> , dag file processor - it might be way easier (and safer) to introduce > >> multi-team in Airflow 3 rather than 2 (or we can implement it > differently > >> in Airflow 2 and differently in Airflow 3). > >> > >> > >> > >> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka <vik...@astronomer.io.invalid > > > >> wrote: > >> > >>> Fellow Airflowers, > >>> > >>> I am following up on some of the proposed changes in the Airflow 3 > proposal > >>> < > >>> > https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/ > >>>> , > >>> where more information was requested by the community, specifically > around > >>> the injection of Task Execution Secrets. This topic has been discussed > at > >>> various times with a variety of names, but here is a holistic proposal > >>> around the whole task context mechanism. > >>> > >>> This is not yet a full fledged AIP, but is intended to facilitate a > >>> structured discussion, which will then be followed up with a formal AIP > >>> within the next two weeks. I have included most of the text here, but > >>> please give detailed feedback in the attached document > >>> < > >>> > https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/ > >>>> , > >>> so that we can have a contextual discussion around specific points > which > >>> may need more detail. > >>> --- > >>> Motivation > >>> > >>> Historically, Airflow’s task execution context has been oriented around > >>> local execution within a relatively trusted networking cluster. > >>> > >>> This includes: > >>> > >>> - > >>> > >>> the interaction between the Executor and the process of launching a > task > >>> on Airflow Workers, > >>> - > >>> > >>> the interaction between the Workers and the Airflow meta-database for > >>> connection and environment information as part of initial task > startup, > >>> - > >>> > >>> the interaction between the Airflow Workers and the rest of Airflow > for > >>> heartbeat information, and so on. > >>> > >>> This has been accomplished by colocating all of the Airflow task > execution > >>> code with the user task code in the same container and process. > >>> > >>> > >>> > >>> For Airflow users at scale i.e. supporting multiple data teams, this > has > >>> posed many operational challenges: > >>> > >>> - > >>> > >>> Dependency conflicts for administrators supporting data teams using > >>> different versions of providers, libraries, or python packages > >>> - > >>> > >>> Security challenge in the running of customer-defined code (task code > >>> within the DAGs) for multiple customers within the same operating > >>> environment and service accounts > >>> - > >>> > >>> Scalability of Airflow since one of the core Airflow scalability > >>> limitations has been the number of concurrent database connections > >>> supported by the underlying database instance. To alleviate this > >>> problem, > >>> we have consistently, as an Airflow community, recommended the use of > >>> PgBouncer for connection pooling, as part of an Airflow deployment. > >>> - > >>> > >>> Operational issues caused by unintentional reliance on internal > Airflow > >>> constructs within the DAG/Task code, which only and unexpectedly show > >>> up as > >>> part of Airflow production operations, coincidentally with, but not > >>> limited > >>> to upgrades and migrations. > >>> - > >>> > >>> Operational management based on the above for Airflow platform teams > at > >>> scale, because different data teams naturally operate at different > >>> velocities. Attempting to support these different teams with a common > >>> Airflow environment is unnecessarily challenging. > >>> > >>> > >>> > >>> The internal API to reduce the need for interaction between the Airflow > >>> Workers and the metadatabase is a big and necessary step forward. > However, > >>> it doesn’t fully address the above challenges. The proposal below > builds on > >>> the internal API proposal and goes significantly further to not only > >>> address these challenges above, but also enable the following key use > >>> cases: > >>> > >>> 1. > >>> > >>> Ensure that this interface reduces the interaction between the code > >>> running within the Task and the rest of Airflow. This is to address > >>> unintended ripple effects from core Airflow changes which has caused > >>> numerous Airflow upgrade issues, because Task (i.e. DAG) code relied > on > >>> Core Airflow abstractions. This has been a common problem pointed > out by > >>> numerous Airflow users including early adopters. > >>> 2. > >>> > >>> Enable quick, performant execution of tasks on local, trusted > networks, > >>> without requiring the Airflow workers / tasks to connect to the > Airflow > >>> database to obtain all the information required for task startup, > >>> 3. > >>> > >>> Enable remote execution of Airflow tasks across network boundaries, > by > >>> establishing a clean interface for Airflow workers on remote networks > >>> to be > >>> able to connect back to a central Airflow service to access all > >>> information > >>> needed for task execution. This is foundational work for remote > >>> execution. > >>> 4. > >>> > >>> Enable a clean language agnostic interface for task execution, with > >>> support for multiple language bindings, so that Airflow tasks can be > >>> written in languages beyond Python. > >>> > >>> Proposal > >>> > >>> The proposal here has multiple parts as detailed below. > >>> > >>> 1. > >>> > >>> Formally split out the Task Execution Interface as the Airflow Task > SDK > >>> (possibly name it as the Airflow SDK), which would be the only > >>> interface to > >>> and from Airflow Task User code to the Airflow system components > >>> including > >>> the meta-database, Airflow Executor, etc. > >>> 2. > >>> > >>> Disable all direct database interaction from the Airflow Workers > >>> including Tasks being run on those Airflow Workers and the Airflow > >>> meta-database. > >>> 3. > >>> > >>> The Airflow Task SDK will include interfaces for: > >>> - > >>> > >>> Access to needed Airflow Connections, Variables, and XCom values > >>> - > >>> > >>> Report heartbeat > >>> - > >>> > >>> Record logs > >>> - > >>> > >>> Report metrics > >>> 4. > >>> > >>> The Airflow Task SDK will support a Push mechanism for speedy local > >>> execution in trusted environments. > >>> 5. > >>> > >>> The Airflow Task SDK will also support a Pull mechanism for the > remote > >>> Task execution environments to access information from an Airflow > >>> instance > >>> over network boundaries. > >>> 6. > >>> > >>> The Airflow Task SDK will be designed to support multiple language > >>> bindings, with the first language binding of course being Python. > >>> > >>> > >>> Assumption: The existing AIP for Internal API covers the interaction > >>> between the Airflow workers and Airflow metadatabase for heartbeat > >>> information, persisting XComs, and so on. > >>> -- > >>> > >>> Best regards, > >>> > >>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance Martineau > >>> > >