Re: [DISCUSS] AIP-72 Task Execution Interface (aka Task SDK)

Michał Modras Tue, 09 Jul 2024 08:27:31 -0700

Thanks for work on this Ash, great to see how the proposal is getting more
material. Overall I'm positive about the general direction, I left a few
comments (and questions) on some of the sections.


On Mon, Jul 8, 2024 at 6:11 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> > I’d love to support websockets/async but having a UI action “instantly”
> (or even near instantly) be sent to the task on the worker is a very
> difficult thing to achieve (you need some broadcast or message bus type
> process) so I don’t want to tie this AIP on to needing that.
>
> Fair.
>
> On Mon, Jul 8, 2024 at 6:08 PM Ash Berlin-Taylor <a...@apache.org> wrote:
> >
> > My view is that Async/websockets or not is (largely) an implementation
> detail and not something we have to worry about at this stage.
> >
> > By that I mean that the messages that flow back and forth between server
> and client are unchanged (and it’s either JSON/msgspec/gRPC etc). In the
> specific case of a task termination request I’d say that we can just have
> the next time the task heartbeats it can get cancelled, as all task status
> changes are asynchronous in action by definition.
> >
> > I’d love to support websockets/async but having a UI action “instantly”
> (or even near instantly) be sent to the task on the worker is a very
> difficult thing to achieve (you need some broadcast or message bus type
> process) so I don’t want to tie this AIP on to needing that.
> >
> > Ash
> >
> > > On 5 Jul 2024, at 17:31, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > I have a comment there - originated from Jens' question in the
> > > document - related to some basic setup of the API and specifically
> > > async vs. sync approach. I have a feeling that the API for tasks would
> > > benefit a lot from using websockets and async-first approach.
> > > Previously we've been doing heartbeating on our own, while websockets
> > > have built in capability of heartbeating opened connecitions, and by
> > > the fact that websocket communication is bi-directional, it would
> > > allow for things like almost instantaneous killing of running tasks
> > > rather than waiting for heartbeats.
> > >
> > > I'd say now is a good time to think about it - and maybe some of us
> > > have bigger experience with async api / websockets to be able to share
> > > their experiences, but since we are moving to "Http" interface for
> > > tasks, async way of communication via websockets is out there for
> > > quite a while and it has some undeniable advantages, and there are a
> > > number of frameworks (including FastAPI) that support it and possibly
> > > it's the best time to consider it.
> > >
> > > J.
> > >
> > >
> > > On Wed, Jul 3, 2024 at 3:39 PM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> > >>
> > >> Hi all,
> > >>
> > >> I’ve made some small changes to this AIP and I’m now happy with the
> state of it.
> > >>
> > >> First, a general point: I’ve tried to not overly-specify too many of
> the details on this one — for instance how viewing in-progress log will be
> handled is a TBD, but we know the constraints and the final details can
> shake our during implementation.
> > >>
> > >> A summary of changes since the previous link:
> > >>
> > >> - Added a section on "Extend Executor interface” to tidy up the
> executor interface and move us away from the “run this command string”
> approach. I’ve named this new thing “Activity”. (In the past we have thrown
> around the name “workload”, but that is too close to “workflow” which is
> analogous to a DAG so I’ve picked a different name)
> > >> - Add an example of how Celery tasks might get context injected (
> > >> - Note that triggerers won’t be allowed direct DB access anymore
> either, they run user code so are all just workers
> > >> - Add some simple version/feature introspection idea to the API so
> that it’s easier to build forward/backwards compatibility in to workers if
> need.
> > >>
> > >>
> https://cwiki.apache.org/confluence/pages/diffpagesbyversion.action?pageId=311626182&originalVersion=16&revisedVersion=20
> > >>
> > >>
> > >> I’d like to start a vote on this soon, but given the 4th July holiday
> in the US I suspect we will be a bit reduced in presences of people, so
> I’ll give people until Tuesday 9th July to comment and will start the vote
> then if there are no major items outstanding.
> > >>
> > >> Thanks,
> > >> Ash
> > >>
> > >>
> > >>> On 14 Jun 2024, at 19:55, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >>>
> > >>> First pass done - especially around security aspects of it, Looks
> great.
> > >>>
> > >>> On Fri, Jun 14, 2024 at 2:55 PM Ash Berlin-Taylor <a...@apache.org>
> wrote:
> > >>>
> > >>>> I’ve written up a lot more of the implementation details into an AIP
> > >>>> https://cwiki.apache.org/confluence/x/xgmTEg
> > >>>>
> > >>>> It’s still marked as Draft/Work In Progress for now as there are few
> > >>>> details we know we need to cover before the doc is complete.
> > >>>>
> > >>>> (There was also some discussion in the dev call about a different
> name for
> > >>>> this AIP)
> > >>>>
> > >>>>> On 7 Jun 2024, at 19:25, Ash Berlin-Taylor <a...@apache.org> wrote:
> > >>>>>
> > >>>>>> IMHO - if we do not want to support DB access at all from workers,
> > >>>>> triggerrers and DAG file processors, we should replace the current
> "DB"
> > >>>>> bound interface with a new one specifically designed for this
> > >>>>> bi-directional direct communication Executor <-> Workers,
> > >>>>>
> > >>>>> That is exactly what I was thinking too (both that no DB should be
> the
> > >>>> only option in v3, and that we need a bidirectional purpose designed
> > >>>> interface) and am working up the details.
> > >>>>>
> > >>>>> One of the key features of this will be giving each task try a
> "strong
> > >>>> identity" that the API server can use to identify and trust the
> requests,
> > >>>> likely some form of signed JWT.
> > >>>>>
> > >>>>> I just need to finish off some other work before I can move over to
> > >>>> focus Airflow fully
> > >>>>>
> > >>>>> -a
> > >>>>>
> > >>>>> On 7 June 2024 18:01:56 BST, Jarek Potiuk <ja...@potiuk.com>
> wrote:
> > >>>>>> I added some comments here and I think there is one big thing
> that
> > >>>> should
> > >>>>>> be clarified when we get to "task isolation" - mainly dependance
> of it
> > >>>> on
> > >>>>>> AIP-44.
> > >>>>>>
> > >>>>>> The Internal gRPC API (AIP-44) was only designed in the way it was
> > >>>> designed
> > >>>>>> to allow using the same codebase to be used with/without DB. It's
> based
> > >>>> on
> > >>>>>> the assumption that a limited set of changes will be needed (that
> was
> > >>>>>> underestimated) in order to support both DB and GRPC ways of
> > >>>> communication
> > >>>>>> between workers/triggerers/DAG file processors at the same time.
> That
> > >>>> was a
> > >>>>>> basic assumption for AIP-44 - that we will want to keep both ways
> and
> > >>>>>> maximum backwards compatibility (including "pull" model of worker
> > >>>> getting
> > >>>>>> connections, variables, and updating task state in the Airflow
> DB). We
> > >>>> are
> > >>>>>> still using "DB" as a way to communicate between those components
> and
> > >>>> this
> > >>>>>> does not change with AIP-44.
> > >>>>>>
> > >>>>>> But for Airflow 3 the whole context is changed. If we go with the
> > >>>>>> assumption that Airflow 3 will only have isolated tasks and no DB
> > >>>> "option",
> > >>>>>> I personally think using AIP-44 for that is a mistake. AIP-44 is
> merely
> > >>>> a
> > >>>>>> wrapper over existing DB calls designed to be kept updated
> together with
> > >>>>>> the DB code, and the whole synchronisation of state, heartbeats,
> > >>>> variables
> > >>>>>> and connection access still uses the same "DB communication"
> model and
> > >>>>>> there is basically no way we can get it more scalable this way.
> We will
> > >>>>>> still have the same limitations on the DB - where a number of DB
> > >>>>>> connections will be replaced with a number of GRPC connections,
> > >>>> Essentially
> > >>>>>> - more scalability and performance has never been the goal of
> AIP-44-
> > >>>> all
> > >>>>>> the assumptions are that it only brings isolation but nothing
> more will
> > >>>>>> change. So I think it does not address some of the fundamental
> problems
> > >>>>>> stated in this "isolation" document.
> > >>>>>>
> > >>>>>> Essentially AIP-44 merely exposes a small-ish number of methods
> (bigger
> > >>>>>> than initially anticipated) but it only wraps around the existing
> DB
> > >>>>>> mechanism. Essentially from the performance and scalability point
> of
> > >>>> view -
> > >>>>>> we do not get much more than currently when using pgbouncer. This
> one
> > >>>>>> essentially turns a big number of connections coming from workers
> into a
> > >>>>>> smaller number of pooled connections that pgbounder manages
> internal and
> > >>>>>> multiplexes the calls over. With the difference that unlike AIP-44
> > >>>> Internal
> > >>>>>> API server, pgbouncer does not limit the operations you can do
> from the
> > >>>>>> worker/triggerer/dag file processor - that's the main difference
> between
> > >>>>>> using pgbouncer and using our own Internal-API server.
> > >>>>>>
> > >>>>>> IMHO - if we do not want to support DB access at all from workers,
> > >>>>>> triggerrers and DAG file processors, we should replace the
> current "DB"
> > >>>>>> bound interface with a new one specifically designed for this
> > >>>>>> bi-directional direct communication Executor <-> Workers, more in
> line
> > >>>> with
> > >>>>>> what Jens described in AIP-69 (and for example WebSocket and
> > >>>> asynchronous
> > >>>>>> communication comes immediately to my mind if I did not have to
> use DB
> > >>>> for
> > >>>>>> that communication). This is also why I put the AIP-67 on hold
> because
> > >>>> IF
> > >>>>>> we go that direction that we have "new" interface between worker,
> > >>>> triggerer
> > >>>>>> , dag file processor - it might be way easier (and safer) to
> introduce
> > >>>>>> multi-team in Airflow 3 rather than 2 (or we can implement it
> > >>>> differently
> > >>>>>> in Airflow 2 and differently in Airflow 3).
> > >>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> On Tue, Jun 4, 2024 at 3:58 PM Vikram Koka
> <vik...@astronomer.io.invalid
> > >>>>>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Fellow Airflowers,
> > >>>>>>>
> > >>>>>>> I am following up on some of the proposed changes in the Airflow
> 3
> > >>>> proposal
> > >>>>>>> <
> > >>>>>>>
> > >>>>
> https://docs.google.com/document/d/1MTr53101EISZaYidCUKcR6mRKshXGzW6DZFXGzetG3E/
> > >>>>>>>> ,
> > >>>>>>> where more information was requested by the community,
> specifically
> > >>>> around
> > >>>>>>> the injection of Task Execution Secrets. This topic has been
> discussed
> > >>>> at
> > >>>>>>> various times with a variety of names, but here is a holistic
> proposal
> > >>>>>>> around the whole task context mechanism.
> > >>>>>>>
> > >>>>>>> This is not yet a full fledged AIP, but is intended to
> facilitate a
> > >>>>>>> structured discussion, which will then be followed up with a
> formal AIP
> > >>>>>>> within the next two weeks. I have included most of the text
> here, but
> > >>>>>>> please give detailed feedback in the attached document
> > >>>>>>> <
> > >>>>>>>
> > >>>>
> https://docs.google.com/document/d/1BG8f4X2YdwNgHTtHoAyxA69SC_X0FFnn17PlzD65ljA/
> > >>>>>>>> ,
> > >>>>>>> so that we can have a contextual discussion around specific
> points
> > >>>> which
> > >>>>>>> may need more detail.
> > >>>>>>> ---
> > >>>>>>> Motivation
> > >>>>>>>
> > >>>>>>> Historically, Airflow’s task execution context has been oriented
> around
> > >>>>>>> local execution within a relatively trusted networking cluster.
> > >>>>>>>
> > >>>>>>> This includes:
> > >>>>>>>
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> the interaction between the Executor and the process of
> launching a
> > >>>> task
> > >>>>>>> on Airflow Workers,
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> the interaction between the Workers and the Airflow
> meta-database for
> > >>>>>>> connection and environment information as part of initial task
> > >>>> startup,
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> the interaction between the Airflow Workers and the rest of
> Airflow
> > >>>> for
> > >>>>>>> heartbeat information, and so on.
> > >>>>>>>
> > >>>>>>> This has been accomplished by colocating all of the Airflow task
> > >>>> execution
> > >>>>>>> code with the user task code in the same container and process.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> For Airflow users at scale i.e. supporting multiple data teams,
> this
> > >>>> has
> > >>>>>>> posed many operational challenges:
> > >>>>>>>
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> Dependency conflicts for administrators supporting data teams
> using
> > >>>>>>> different versions of providers, libraries, or python packages
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> Security challenge in the running of customer-defined code (task
> code
> > >>>>>>> within the DAGs) for multiple customers within the same operating
> > >>>>>>> environment and service accounts
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> Scalability of Airflow since one of the core Airflow scalability
> > >>>>>>> limitations has been the number of concurrent database
> connections
> > >>>>>>> supported by the underlying database instance. To alleviate this
> > >>>>>>> problem,
> > >>>>>>> we have consistently, as an Airflow community, recommended the
> use of
> > >>>>>>> PgBouncer for connection pooling, as part of an Airflow
> deployment.
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> Operational issues caused by unintentional reliance on internal
> > >>>> Airflow
> > >>>>>>> constructs within the DAG/Task code, which only and unexpectedly
> show
> > >>>>>>> up as
> > >>>>>>> part of Airflow production operations, coincidentally with, but
> not
> > >>>>>>> limited
> > >>>>>>> to upgrades and migrations.
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>> Operational management based on the above for Airflow platform
> teams
> > >>>> at
> > >>>>>>> scale, because different data teams naturally operate at
> different
> > >>>>>>> velocities. Attempting to support these different teams with a
> common
> > >>>>>>> Airflow environment is unnecessarily challenging.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> The internal API to reduce the need for interaction between the
> Airflow
> > >>>>>>> Workers and the metadatabase is a big and necessary step forward.
> > >>>> However,
> > >>>>>>> it doesn’t fully address the above challenges. The proposal below
> > >>>> builds on
> > >>>>>>> the internal API proposal and goes significantly further to not
> only
> > >>>>>>> address these challenges above, but also enable the following
> key use
> > >>>>>>> cases:
> > >>>>>>>
> > >>>>>>> 1.
> > >>>>>>>
> > >>>>>>> Ensure that this interface reduces the interaction between the
> code
> > >>>>>>> running within the Task and the rest of Airflow. This is to
> address
> > >>>>>>> unintended ripple effects from core Airflow changes which has
> caused
> > >>>>>>> numerous Airflow upgrade issues, because Task (i.e. DAG) code
> relied
> > >>>> on
> > >>>>>>> Core Airflow abstractions. This has been a common problem pointed
> > >>>> out by
> > >>>>>>> numerous Airflow users including early adopters.
> > >>>>>>> 2.
> > >>>>>>>
> > >>>>>>> Enable quick, performant execution of tasks on local, trusted
> > >>>> networks,
> > >>>>>>> without requiring the Airflow workers / tasks to connect to the
> > >>>> Airflow
> > >>>>>>> database to obtain all the information required for task startup,
> > >>>>>>> 3.
> > >>>>>>>
> > >>>>>>> Enable remote execution of Airflow tasks across network
> boundaries,
> > >>>> by
> > >>>>>>> establishing a clean interface for Airflow workers on remote
> networks
> > >>>>>>> to be
> > >>>>>>> able to connect back to a central Airflow service to access all
> > >>>>>>> information
> > >>>>>>> needed for task execution. This is foundational work for remote
> > >>>>>>> execution.
> > >>>>>>> 4.
> > >>>>>>>
> > >>>>>>> Enable a clean language agnostic interface for task execution,
> with
> > >>>>>>> support for multiple language bindings, so that Airflow tasks
> can be
> > >>>>>>> written in languages beyond Python.
> > >>>>>>>
> > >>>>>>> Proposal
> > >>>>>>>
> > >>>>>>> The proposal here has multiple parts as detailed below.
> > >>>>>>>
> > >>>>>>> 1.
> > >>>>>>>
> > >>>>>>> Formally split out the Task Execution Interface as the Airflow
> Task
> > >>>> SDK
> > >>>>>>> (possibly name it as the Airflow SDK), which would be the only
> > >>>>>>> interface to
> > >>>>>>> and from Airflow Task User code to the Airflow system components
> > >>>>>>> including
> > >>>>>>> the meta-database, Airflow Executor, etc.
> > >>>>>>> 2.
> > >>>>>>>
> > >>>>>>> Disable all direct database interaction from the Airflow Workers
> > >>>>>>> including Tasks being run on those Airflow Workers and the
> Airflow
> > >>>>>>> meta-database.
> > >>>>>>> 3.
> > >>>>>>>
> > >>>>>>> The Airflow Task SDK will include interfaces for:
> > >>>>>>> -
> > >>>>>>>
> > >>>>>>>    Access to needed Airflow Connections, Variables, and XCom
> values
> > >>>>>>>    -
> > >>>>>>>
> > >>>>>>>    Report heartbeat
> > >>>>>>>    -
> > >>>>>>>
> > >>>>>>>    Record logs
> > >>>>>>>    -
> > >>>>>>>
> > >>>>>>>    Report metrics
> > >>>>>>>    4.
> > >>>>>>>
> > >>>>>>> The Airflow Task SDK will support a Push mechanism for speedy
> local
> > >>>>>>> execution in trusted environments.
> > >>>>>>> 5.
> > >>>>>>>
> > >>>>>>> The Airflow Task SDK will also support a Pull mechanism for the
> > >>>> remote
> > >>>>>>> Task execution environments to access information from an Airflow
> > >>>>>>> instance
> > >>>>>>> over network boundaries.
> > >>>>>>> 6.
> > >>>>>>>
> > >>>>>>> The Airflow Task SDK will be designed to support multiple
> language
> > >>>>>>> bindings, with the first language binding of course being Python.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Assumption: The existing AIP for Internal API covers the
> interaction
> > >>>>>>> between the Airflow workers and Airflow metadatabase for
> heartbeat
> > >>>>>>> information, persisting XComs, and so on.
> > >>>>>>> --
> > >>>>>>>
> > >>>>>>> Best regards,
> > >>>>>>>
> > >>>>>>> Vikram Koka, Ash Berlin-Taylor, Kaxil Naik, and Constance
> Martineau
> > >>>>>>>
> > >>>>
> > >>>>
> > >>
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > > For additional commands, e-mail: dev-h...@airflow.apache.org
> > >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> > For additional commands, e-mail: dev-h...@airflow.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@airflow.apache.org
> For additional commands, e-mail: dev-h...@airflow.apache.org
>
>

Re: [DISCUSS] AIP-72 Task Execution Interface (aka Task SDK)

Reply via email to