Re: Roadmap ideas for Airflow 2.2 and beyond

Jarek Potiuk Tue, 15 Jun 2021 05:45:30 -0700

I think those are the most important "big" things to solve "soon-ish"
and I would love to be part of it or even lead some of that:

1)  General "isolation" problem of Airflow components and workloads.
This is necessary for multi-tenancy which is a highly requested
feature. The main list of things to address:
* core scheduler can do anything but the parser part of it should not
be able to access DB and parsing DAGs should run in a sandbox.
* workers should not be able to talk to the DB directly. They should
have a very strict small API to talk back to the scheduler.
* each task should be executed in its own sandbox even if the same
worker is reused
* (concurring with Elad) we should be able to define namespaces for
connections/variables/xcoms etc. (part of the worker/scheduler API)
* the webserver should never execute any user code (already solved I believe)

2) Add `binary` caching and task affinity mechanisms to optimise
workload execution. This could be used for a number of things for
"temporary/local/artifact intensive" tasks
* per-task virtualenvs (cached between executions)
* caching machine learning models to share between tasks run on the same machine
* optimising preparation of rarely changed binary data shared between
multiple machine

3) Implement support for read DB replicas/maybe even separate
schema/separate databases for various parts of the model.

That might further lift the (potential) limits of scalability of
Airflow, where likely a lot of DB operations for Airflow could be
redirected to a read replica and off-load the transactional DB for
more scheduling actions. Not sure if this is a real limitation now, so
we might not need that.

4) Or maybe even add support for Active-Active DB setup (though that
might not be needed after 3) gets implemented and 3) seems like an
easier one.

5) Full async support for Airflow - part of it is coming in Deferred
Tasks, but it addresses only part of the problem. Many tasks/operators
in Airflow are async in nature (and they basically wait for the
external services to complete), so intra-task async support could be a
good improvement.

6) Big Hairy Thing - get rid of relational DB. Rather very long term
but if we do that, we could do it instead of 3,4. I think about a
bolder move of modernising the "Heart" of Airflow and replacing the
Relational DB with a more cloud-native approach. I don't really know
what could be a better replacement (to also fit well with on-premise
cases), but using classic relational DB feels a bit out of place (just
a bit and I have no strong feelings about it).

On Tue, Jun 15, 2021 at 2:05 PM Elad Kalif <[email protected]> wrote:
>
> I find working with sensitive Connections in Airflow quite difficult. In 
> simple words - once a connection is defined anyone who knows about it can use 
> it. This is a problem when you work with sensitive data like HR or finance.
> The issue is not about storing the connection details securely but rather 
> once defined - who can use it? how to prevent from unauthorized users to 
> access it?
> I would love to see the concept of namespace/areas in Airflow so 
> Pools/Connections/Variables/Dags and user login are all associated to a 
> specific namespace/areas. Kinda similar to the namespace concept in K8s I 
> guess.
>
> For the moment we solved it by having two separate Airflow instances (one 
> regular and one for sensitive data) but this is very difficult to maintain.
>
> On Tue, Jun 15, 2021 at 12:54 PM Ash Berlin-Taylor <[email protected]> wrote:
>>
>> Hi everyone,
>>
>> As I'm sure many of you are aware I (along with Aizhamal) am giving the 
>> opening keynote at this year's Airflow summit, and I'm covering "what's next 
>> after 2.0" -- essentially what is the roadmap for Airflow for the next 12-18 
>> months.
>>
>> Since Airflow is a community project first and foremost I'd like to get all 
>> your ideas, no matter how off the wall :)
>>
>> I've got my own ideaas, and 2.2 is fairly firm already (AIPs 39 and 40), but 
>> 2.3 and beyond starts to get less clear, so if you have something that you'd 
>> like to see Airflow be able to do or do better, now is the time to speak up.
>>
>> You don't have to have a solution, just "I find doing X 
>> hard/annoying/difficult" is enough.
>>
>> (And a general reminder: the roadmap is a statement of intent, not a promise 
>> of timeline or even that a feature will actually be implemented)
>>
>> To keep this thread manageable, please can we avoid discussions _in this 
>> thread_ about ideas and keep +1/me too's to a minimum.
>>
>> Cheers,
>> -ash

-- 
+48 660 796 129

Re: Roadmap ideas for Airflow 2.2 and beyond

Reply via email to