Re: Roadmap ideas for Airflow 2.2 and beyond

Tomasz Urbaszek Tue, 15 Jun 2021 10:27:01 -0700

+1 to Rob suggestion.

I would also add a task callback "on delay" that would work better than the
current SLA mechanism. For example I have a task scheduled daily, but the
task should not take more than 5 minutes (but I don't want to set timeout)
and I would like to get an email if the task last longer.


Tomek

On Tue, 15 Jun 2021 at 16:51, Rob Deeb <[email protected]> wrote:

> One really big feature I’ve been needing recently is a set of DAG
> lifecycle callbacks.
>
> Currently we have a few that exist at task level for on_failure,
> on_success, etc, but this can be extended to a full set at the DAG level
> too based on state, including before and after transaction callbacks. These
> would look something like
>
> dag_will_begin
> dag_did_begin
> dag_will_retry
> dag_did_retry (or retry as a flag in begin)
> dag_will_fail
> dag_did_fail
> dag_will_succeed
> dag_did_succeed
>
> These could be created in some way along the lines of using SQLAlchemy
> events.
>
> Having these available to objects like custom backends are already proving
> to have great value. We want different behavior to occur based on what
> state the DAG is in and this seems like one way to get at that.
>
> :)
>
> - Rob
>
> > On Jun 15, 2021, at 8:45 AM, Jarek<[email protected]> wrote:
> >
> > I think those are the most important "big" things to solve "soon-ish"
> > and I would love to be part of it or even lead some of that:
> >
> > 1)  General "isolation" problem of Airflow components and workloads.
> > This is necessary for multi-tenancy which is a highly requested
> > feature. The main list of things to address:
> > * core scheduler can do anything but the parser part of it should not
> > be able to access DB and parsing DAGs should run in a sandbox.
> > * workers should not be able to talk to the DB directly. They should
> > have a very strict small API to talk back to the scheduler.
> > * each task should be executed in its own sandbox even if the same
> > worker is reused
> > * (concurring with Elad) we should be able to define namespaces for
> > connections/variables/xcoms etc. (part of the worker/scheduler API)
> > * the webserver should never execute any user code (already solved I
> believe)
> >
> > 2) Add `binary` caching and task affinity mechanisms to optimise
> > workload execution. This could be used for a number of things for
> > "temporary/local/artifact intensive" tasks
> > * per-task virtualenvs (cached between executions)
> > * caching machine learning models to share between tasks run on the same
> machine
> > * optimising preparation of rarely changed binary data shared between
> > multiple machine
> >
> > 3) Implement support for read DB replicas/maybe even separate
> > schema/separate databases for various parts of the model.
> >
> > That might further lift the (potential) limits of scalability of
> > Airflow, where likely a lot of DB operations for Airflow could be
> > redirected to a read replica and off-load the transactional DB for
> > more scheduling actions. Not sure if this is a real limitation now, so
> > we might not need that.
> >
> > 4) Or maybe even add support for Active-Active DB setup (though that
> > might not be needed after 3) gets implemented and 3) seems like an
> > easier one.
> >
> > 5) Full async support for Airflow - part of it is coming in Deferred
> > Tasks, but it addresses only part of the problem. Many tasks/operators
> > in Airflow are async in nature (and they basically wait for the
> > external services to complete), so intra-task async support could be a
> > good improvement.
> >
> > 6) Big Hairy Thing - get rid of relational DB. Rather very long term
> > but if we do that, we could do it instead of 3,4. I think about a
> > bolder move of modernising the "Heart" of Airflow and replacing the
> > Relational DB with a more cloud-native approach. I don't really know
> > what could be a better replacement (to also fit well with on-premise
> > cases), but using classic relational DB feels a bit out of place (just
> > a bit and I have no strong feelings about it).
> >
> >
> >
> >> On Tue, Jun 15, 2021 at 2:05 PM Elad Kalif <[email protected]> wrote:
> >>
> >> I find working with sensitive Connections in Airflow quite difficult.
> In simple words - once a connection is defined anyone who knows about it
> can use it. This is a problem when you work with sensitive data like HR or
> finance.
> >> The issue is not about storing the connection details securely but
> rather once defined - who can use it? how to prevent from unauthorized
> users to access it?
> >> I would love to see the concept of namespace/areas in Airflow so
> Pools/Connections/Variables/Dags and user login are all associated to a
> specific namespace/areas. Kinda similar to the namespace concept in K8s I
> guess.
> >>
> >> For the moment we solved it by having two separate Airflow instances
> (one regular and one for sensitive data) but this is very difficult to
> maintain.
> >>
> >>> On Tue, Jun 15, 2021 at 12:54 PM Ash Berlin-Taylor <[email protected]>
> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> As I'm sure many of you are aware I (along with Aizhamal) am giving
> the opening keynote at this year's Airflow summit, and I'm covering "what's
> next after 2.0" -- essentially what is the roadmap for Airflow for the next
> 12-18 months.
> >>>
> >>> Since Airflow is a community project first and foremost I'd like to
> get all your ideas, no matter how off the wall :)
> >>>
> >>> I've got my own ideaas, and 2.2 is fairly firm already (AIPs 39 and
> 40), but 2.3 and beyond starts to get less clear, so if you have something
> that you'd like to see Airflow be able to do or do better, now is the time
> to speak up.
> >>>
> >>> You don't have to have a solution, just "I find doing X
> hard/annoying/difficult" is enough.
> >>>
> >>> (And a general reminder: the roadmap is a statement of intent, not a
> promise of timeline or even that a feature will actually be implemented)
> >>>
> >>> To keep this thread manageable, please can we avoid discussions _in
> this thread_ about ideas and keep +1/me too's to a minimum.
> >>>
> >>> Cheers,
> >>> -ash
> >
> >
> >
> > --
> > +48 660 796 129
>

Re: Roadmap ideas for Airflow 2.2 and beyond

Reply via email to