Thanks for work on this Ash, great to see how the proposal is getting more material. Overall I'm positive about the general direction, I left a few comments (and questions) on some of the sections.
On Mon, Jul 8, 2024 at 6:11 PM Jarek Potiuk <ja...@potiuk.com> wrote: > > I’d love to support websockets/async but having a UI action “instantly” > (or even near instantly) be sent to the task on the worker is a very > difficult thing to achieve (you need some broadcast or message bus type > process) so I don’t want to tie this AIP on to needing that. > > Fair. > > On Mon, Jul 8, 2024 at 6:08 PM Ash Berlin-Taylor <a...@apache.org> wrote: > > > > My view is that Async/websockets or not is (largely) an implementation > detail and not something we have to worry about at this stage. > > > > By that I mean that the messages that flow back and forth between server > and client are unchanged (and it’s either JSON/msgspec/gRPC etc). In the > specific case of a task termination request I’d say that we can just have > the next time the task heartbeats it can get cancelled, as all task status > changes are asynchronous in action by definition. > > > > I’d love to support websockets/async but having a UI action “instantly” > (or even near instantly) be sent to the task on the worker is a very > difficult thing to achieve (you need some broadcast or message bus type > process) so I don’t want to tie this AIP on to needing that. > > > > Ash > > > > > On 5 Jul 2024, at 17:31, Jarek Potiuk <ja...@potiuk.com> wrote: > > > > > > I have a comment there - originated from Jens' question in the > > > document - related to some basic setup of the API and specifically > > > async vs. sync approach. I have a feeling that the API for tasks would > > > benefit a lot from using websockets and async-first approach. > > > Previously we've been doing heartbeating on our own, while websockets > > > have built in capability of heartbeating opened connecitions, and by > > > the fact that websocket communication is bi-directional, it would > > > allow for things like almost instantaneous killing of running tasks > > > rather than waiting for heartbeats. > > > > > > I'd say now is a good time to think about it - and maybe some of us > > > have bigger experience with async api / websockets to be able to share > > > their experiences, but since we are moving to "Http" interface for > > > tasks, async way of communication via websockets is out there for > > > quite a while and it has some undeniable advantages, and there are a > > > number of frameworks (including FastAPI) that support it and possibly > > > it's the best time to consider it. > > > > > > J. > > > > > > > > > On Wed, Jul 3, 2024 at 3:39 PM Ash Berlin-Taylor <a...@apache.org> > wrote: > > >> > > >> Hi all, > > >> > > >> I’ve made some small changes to this AIP and I’m now happy with the > state of it. > > >> > > >> First, a general point: I’ve tried to not overly-specify too many of > the details on this one — for instance how viewing in-progress log will be > handled is a TBD, but we know the constraints and the final details can > shake our during implementation. > > >> > > >> A summary of changes since the previous link: > > >> > > >> - Added a section on "Extend Executor interface” to tidy up the > executor interface and move us away from the “run this command string” > approach. I’ve named this new thing “Activity”. (In the past we have thrown > around the name “workload”, but that is too close to “workflow” which is > analogous to a DAG so I’ve picked a different name) > > >> - Add an example of how Celery tasks might get context injected ( > > >> - Note that triggerers won’t be allowed direct DB access anymore > either, they run user code so are all just workers > > >> - Add some simple version/feature introspection idea to the API so > that it’s easier to build forward/backwards compatibility in to workers if > need. > > >> > > >> > https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=311626182&originalVersion=16&revisedVersion=20 > > >> > > >> > > >> I’d like to start a vote on this soon, but given the 4th July holiday > in the US I suspect we will be a bit reduced in presences of people, so > I’ll give people until Tuesday 9th July to comment and will start the vote > then if there are no major items outstanding. > > >> > > >> Thanks, > > >> Ash > > >> > > >> > > >>> On 14 Jun 2024, at 19:55, Jarek Potiuk <ja...@potiuk.com> wrote: > > >>> > > >>> First pass done - especially around security aspects of it, Looks > great. > > >>> > > >>> On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org> > wrote: > > >>> > > >>>> I’ve written up a lot more of the implementation details into an AIP > > >>>> https://cwiki.apache.org/confluence/x/xgmTEg > > >>>> > > >>>> It’s still marked as Draft/Work In Progress for now as there are few > > >>>> details we know we need to cover before the doc is complete. > > >>>> > > >>>> (There was also some discussion in the dev call about a different > name for > > >>>> this AIP) > > >>>> > > >>>>> On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote: > > >>>>> > > >>>>>> IMHO - if we do not want to support DB access at all from workers, > > >>>>> triggerrers and DAG file processors, we should replace the current > "DB" > > >>>>> bound interface with a new one specifically designed for this > > >>>>> bi-directional direct communication Executor <-> Workers, > > >>>>> > > >>>>> That is exactly what I was thinking too (both that no DB should be > the > > >>>> only option in v3, and that we need a bidirectional purpose designed > > >>>> interface) and am working up the details. > > >>>>> > > >>>>> One of the key features of this will be giving each task try a > "strong > > >>>> identity" that the API server can use to identify and trust the > requests, > > >>>> likely some form of signed JWT. > > >>>>> > > >>>>> I just need to finish off some other work before I can move over to > > >>>> focus Airflow fully > > >>>>> > > >>>>> -a > > >>>>> > > >>>>> On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com> > wrote: > > >>>>>> I added some comments here and I think there is one big thing > that > > >>>> should > > >>>>>> be clarified when we get to "task isolation" - mainly dependance > of it > > >>>> on > > >>>>>> AIP-44. > > >>>>>> > > >>>>>> The Internal gRPC API (AIP-44) was only designed in the way it was > > >>>> designed > > >>>>>> to allow using the same codebase to be used with/without DB. It's > based > > >>>> on > > >>>>>> the assumption that a limited set of changes will be needed (that > was > > >>>>>> underestimated) in order to support both DB and GRPC ways of > > >>>> communication > > >>>>>> between workers/triggerers/DAG file processors at the same time. > That > > >>>> was a > > >>>>>> basic assumption for AIP-44 - that we will want to keep both ways > and > > >>>>>> maximum backwards compatibility (including "pull" model of worker > > >>>> getting > > >>>>>> connections, variables, and updating task state in the Airflow > DB). We > > >>>> are > > >>>>>> still using "DB" as a way to communicate between those components > and > > >>>> this > > >>>>>> does not change with AIP-44. > > >>>>>> > > >>>>>> But for Airflow 3 the whole context is changed. If we go with the > > >>>>>> assumption that Airflow 3 will only have isolated tasks and no DB > > >>>> "option", > > >>>>>> I personally think using AIP-44 for that is a mistake. AIP-44 is > merely > > >>>> a > > >>>>>> wrapper over existing DB calls designed to be kept updated > together with > > >>>>>> the DB code, and the whole synchronisation of state, heartbeats, > > >>>> variables > > >>>>>> and connection access still uses the same "DB communication" > model and > > >>>>>> there is basically no way we can get it more scalable this way. > We will > > >>>>>> still have the same limitations on the DB - where a number of DB > > >>>>>> connections will be replaced with a number of GRPC connections, > > >>>> Essentially > > >>>>>> - more scalability and performance has never been the goal of > AIP-44- > > >>>> all > > >>>>>> the assumptions are that it only brings isolation but nothing > more will > > >>>>>> change. So I think it does not address some of the fundamental > problems > > >>>>>> stated in this "isolation" document. > > >>>>>> > > >>>>>> Essentially AIP-44 merely exposes a small-ish number of methods > (bigger > > >>>>>> than initially anticipated) but it only wraps around the existing > DB > > >>>>>> mechanism. Essentially from the performance and scalability point > of > > >>>> view - > > >>>>>> we do not get much more than currently when using pgbouncer. This > one > > >>>>>> essentially turns a big number of connections coming from workers > into a > > >>>>>> smaller number of pooled connections that pgbounder manages > internal and > > >>>>>> multiplexes the calls over. With the difference that unlike AIP-44 > > >>>> Internal > > >>>>>> API server, pgbouncer does not limit the operations you can do > from the > > >>>>>> worker/triggerer/dag file processor - that's the main difference > between > > >>>>>> using pgbouncer and using our own Internal-API server. > > >>>>>> > > >>>>>> IMHO - if we do not want to support DB access at all from workers, > > >>>>>> triggerrers and DAG file processors, we should replace the > current "DB" > > >>>>>> bound interface with a new one specifically designed for this > > >>>>>> bi-directional direct communication Executor <-> Workers, more in > line > > >>>> with > > >>>>>> what Jens described in AIP-69 (and for example WebSocket and > > >>>> asynchronous > > >>>>>> communication comes immediately to my mind if I did not have to > use DB > > >>>> for > > >>>>>> that communication). This is also why I put the AIP-67 on hold > because > > >>>> IF > > >>>>>> we go that direction that we have "new" interface between worker, > > >>>> triggerer > > >>>>>> , dag file processor - it might be way easier (and safer) to > introduce > > >>>>>> multi-team in Airflow 3 rather than 2 (or we can implement it > > >>>> differently > > >>>>>> in Airflow 2 and differently in Airflow 3). > > >>>>>> > > >>>>>> > > >>>>>> > > >>>>>> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka > <vik...@astronomer.io.invalid > > >>>>> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Fellow Airflowers, > > >>>>>>> > > >>>>>>> I am following up on some of the proposed changes in the Airflow > 3 > > >>>> proposal > > >>>>>>> < > > >>>>>>> > > >>>> > https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/ > > >>>>>>>> , > > >>>>>>> where more information was requested by the community, > specifically > > >>>> around > > >>>>>>> the injection of Task Execution Secrets. This topic has been > discussed > > >>>> at > > >>>>>>> various times with a variety of names, but here is a holistic > proposal > > >>>>>>> around the whole task context mechanism. > > >>>>>>> > > >>>>>>> This is not yet a full fledged AIP, but is intended to > facilitate a > > >>>>>>> structured discussion, which will then be followed up with a > formal AIP > > >>>>>>> within the next two weeks. I have included most of the text > here, but > > >>>>>>> please give detailed feedback in the attached document > > >>>>>>> < > > >>>>>>> > > >>>> > https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/ > > >>>>>>>> , > > >>>>>>> so that we can have a contextual discussion around specific > points > > >>>> which > > >>>>>>> may need more detail. > > >>>>>>> --- > > >>>>>>> Motivation > > >>>>>>> > > >>>>>>> Historically, Airflow’s task execution context has been oriented > around > > >>>>>>> local execution within a relatively trusted networking cluster. > > >>>>>>> > > >>>>>>> This includes: > > >>>>>>> > > >>>>>>> - > > >>>>>>> > > >>>>>>> the interaction between the Executor and the process of > launching a > > >>>> task > > >>>>>>> on Airflow Workers, > > >>>>>>> - > > >>>>>>> > > >>>>>>> the interaction between the Workers and the Airflow > meta-database for > > >>>>>>> connection and environment information as part of initial task > > >>>> startup, > > >>>>>>> - > > >>>>>>> > > >>>>>>> the interaction between the Airflow Workers and the rest of > Airflow > > >>>> for > > >>>>>>> heartbeat information, and so on. > > >>>>>>> > > >>>>>>> This has been accomplished by colocating all of the Airflow task > > >>>> execution > > >>>>>>> code with the user task code in the same container and process. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> For Airflow users at scale i.e. supporting multiple data teams, > this > > >>>> has > > >>>>>>> posed many operational challenges: > > >>>>>>> > > >>>>>>> - > > >>>>>>> > > >>>>>>> Dependency conflicts for administrators supporting data teams > using > > >>>>>>> different versions of providers, libraries, or python packages > > >>>>>>> - > > >>>>>>> > > >>>>>>> Security challenge in the running of customer-defined code (task > code > > >>>>>>> within the DAGs) for multiple customers within the same operating > > >>>>>>> environment and service accounts > > >>>>>>> - > > >>>>>>> > > >>>>>>> Scalability of Airflow since one of the core Airflow scalability > > >>>>>>> limitations has been the number of concurrent database > connections > > >>>>>>> supported by the underlying database instance. To alleviate this > > >>>>>>> problem, > > >>>>>>> we have consistently, as an Airflow community, recommended the > use of > > >>>>>>> PgBouncer for connection pooling, as part of an Airflow > deployment. > > >>>>>>> - > > >>>>>>> > > >>>>>>> Operational issues caused by unintentional reliance on internal > > >>>> Airflow > > >>>>>>> constructs within the DAG/Task code, which only and unexpectedly > show > > >>>>>>> up as > > >>>>>>> part of Airflow production operations, coincidentally with, but > not > > >>>>>>> limited > > >>>>>>> to upgrades and migrations. > > >>>>>>> - > > >>>>>>> > > >>>>>>> Operational management based on the above for Airflow platform > teams > > >>>> at > > >>>>>>> scale, because different data teams naturally operate at > different > > >>>>>>> velocities. Attempting to support these different teams with a > common > > >>>>>>> Airflow environment is unnecessarily challenging. > > >>>>>>> > > >>>>>>> > > >>>>>>> > > >>>>>>> The internal API to reduce the need for interaction between the > Airflow > > >>>>>>> Workers and the metadatabase is a big and necessary step forward. > > >>>> However, > > >>>>>>> it doesn’t fully address the above challenges. The proposal below > > >>>> builds on > > >>>>>>> the internal API proposal and goes significantly further to not > only > > >>>>>>> address these challenges above, but also enable the following > key use > > >>>>>>> cases: > > >>>>>>> > > >>>>>>> 1. > > >>>>>>> > > >>>>>>> Ensure that this interface reduces the interaction between the > code > > >>>>>>> running within the Task and the rest of Airflow. This is to > address > > >>>>>>> unintended ripple effects from core Airflow changes which has > caused > > >>>>>>> numerous Airflow upgrade issues, because Task (i.e. DAG) code > relied > > >>>> on > > >>>>>>> Core Airflow abstractions. This has been a common problem pointed > > >>>> out by > > >>>>>>> numerous Airflow users including early adopters. > > >>>>>>> 2. > > >>>>>>> > > >>>>>>> Enable quick, performant execution of tasks on local, trusted > > >>>> networks, > > >>>>>>> without requiring the Airflow workers / tasks to connect to the > > >>>> Airflow > > >>>>>>> database to obtain all the information required for task startup, > > >>>>>>> 3. > > >>>>>>> > > >>>>>>> Enable remote execution of Airflow tasks across network > boundaries, > > >>>> by > > >>>>>>> establishing a clean interface for Airflow workers on remote > networks > > >>>>>>> to be > > >>>>>>> able to connect back to a central Airflow service to access all > > >>>>>>> information > > >>>>>>> needed for task execution. This is foundational work for remote > > >>>>>>> execution. > > >>>>>>> 4. > > >>>>>>> > > >>>>>>> Enable a clean language agnostic interface for task execution, > with > > >>>>>>> support for multiple language bindings, so that Airflow tasks > can be > > >>>>>>> written in languages beyond Python. > > >>>>>>> > > >>>>>>> Proposal > > >>>>>>> > > >>>>>>> The proposal here has multiple parts as detailed below. > > >>>>>>> > > >>>>>>> 1. > > >>>>>>> > > >>>>>>> Formally split out the Task Execution Interface as the Airflow > Task > > >>>> SDK > > >>>>>>> (possibly name it as the Airflow SDK), which would be the only > > >>>>>>> interface to > > >>>>>>> and from Airflow Task User code to the Airflow system components > > >>>>>>> including > > >>>>>>> the meta-database, Airflow Executor, etc. > > >>>>>>> 2. > > >>>>>>> > > >>>>>>> Disable all direct database interaction from the Airflow Workers > > >>>>>>> including Tasks being run on those Airflow Workers and the > Airflow > > >>>>>>> meta-database. > > >>>>>>> 3. > > >>>>>>> > > >>>>>>> The Airflow Task SDK will include interfaces for: > > >>>>>>> - > > >>>>>>> > > >>>>>>> Access to needed Airflow Connections, Variables, and XCom > values > > >>>>>>> - > > >>>>>>> > > >>>>>>> Report heartbeat > > >>>>>>> - > > >>>>>>> > > >>>>>>> Record logs > > >>>>>>> - > > >>>>>>> > > >>>>>>> Report metrics > > >>>>>>> 4. > > >>>>>>> > > >>>>>>> The Airflow Task SDK will support a Push mechanism for speedy > local > > >>>>>>> execution in trusted environments. > > >>>>>>> 5. > > >>>>>>> > > >>>>>>> The Airflow Task SDK will also support a Pull mechanism for the > > >>>> remote > > >>>>>>> Task execution environments to access information from an Airflow > > >>>>>>> instance > > >>>>>>> over network boundaries. > > >>>>>>> 6. > > >>>>>>> > > >>>>>>> The Airflow Task SDK will be designed to support multiple > language > > >>>>>>> bindings, with the first language binding of course being Python. > > >>>>>>> > > >>>>>>> > > >>>>>>> Assumption: The existing AIP for Internal API covers the > interaction > > >>>>>>> between the Airflow workers and Airflow metadatabase for > heartbeat > > >>>>>>> information, persisting XComs, and so on. > > >>>>>>> -- > > >>>>>>> > > >>>>>>> Best regards, > > >>>>>>> > > >>>>>>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance > Martineau > > >>>>>>> > > >>>> > > >>>> > > >> > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > > For additional commands, e-mail: dev-h...@airflow.apache.org > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org > For additional commands, e-mail: dev-h...@airflow.apache.org > >