I think those are the most important "big" things to solve "soon-ish" and I would love to be part of it or even lead some of that:
1) General "isolation" problem of Airflow components and workloads. This is necessary for multi-tenancy which is a highly requested feature. The main list of things to address: * core scheduler can do anything but the parser part of it should not be able to access DB and parsing DAGs should run in a sandbox. * workers should not be able to talk to the DB directly. They should have a very strict small API to talk back to the scheduler. * each task should be executed in its own sandbox even if the same worker is reused * (concurring with Elad) we should be able to define namespaces for connections/variables/xcoms etc. (part of the worker/scheduler API) * the webserver should never execute any user code (already solved I believe) 2) Add `binary` caching and task affinity mechanisms to optimise workload execution. This could be used for a number of things for "temporary/local/artifact intensive" tasks * per-task virtualenvs (cached between executions) * caching machine learning models to share between tasks run on the same machine * optimising preparation of rarely changed binary data shared between multiple machine 3) Implement support for read DB replicas/maybe even separate schema/separate databases for various parts of the model. That might further lift the (potential) limits of scalability of Airflow, where likely a lot of DB operations for Airflow could be redirected to a read replica and off-load the transactional DB for more scheduling actions. Not sure if this is a real limitation now, so we might not need that. 4) Or maybe even add support for Active-Active DB setup (though that might not be needed after 3) gets implemented and 3) seems like an easier one. 5) Full async support for Airflow - part of it is coming in Deferred Tasks, but it addresses only part of the problem. Many tasks/operators in Airflow are async in nature (and they basically wait for the external services to complete), so intra-task async support could be a good improvement. 6) Big Hairy Thing - get rid of relational DB. Rather very long term but if we do that, we could do it instead of 3,4. I think about a bolder move of modernising the "Heart" of Airflow and replacing the Relational DB with a more cloud-native approach. I don't really know what could be a better replacement (to also fit well with on-premise cases), but using classic relational DB feels a bit out of place (just a bit and I have no strong feelings about it). On Tue, Jun 15, 2021 at 2:05 PM Elad Kalif <[email protected]> wrote: > > I find working with sensitive Connections in Airflow quite difficult. In > simple words - once a connection is defined anyone who knows about it can use > it. This is a problem when you work with sensitive data like HR or finance. > The issue is not about storing the connection details securely but rather > once defined - who can use it? how to prevent from unauthorized users to > access it? > I would love to see the concept of namespace/areas in Airflow so > Pools/Connections/Variables/Dags and user login are all associated to a > specific namespace/areas. Kinda similar to the namespace concept in K8s I > guess. > > For the moment we solved it by having two separate Airflow instances (one > regular and one for sensitive data) but this is very difficult to maintain. > > On Tue, Jun 15, 2021 at 12:54 PM Ash Berlin-Taylor <[email protected]> wrote: >> >> Hi everyone, >> >> As I'm sure many of you are aware I (along with Aizhamal) am giving the >> opening keynote at this year's Airflow summit, and I'm covering "what's next >> after 2.0" -- essentially what is the roadmap for Airflow for the next 12-18 >> months. >> >> Since Airflow is a community project first and foremost I'd like to get all >> your ideas, no matter how off the wall :) >> >> I've got my own ideaas, and 2.2 is fairly firm already (AIPs 39 and 40), but >> 2.3 and beyond starts to get less clear, so if you have something that you'd >> like to see Airflow be able to do or do better, now is the time to speak up. >> >> You don't have to have a solution, just "I find doing X >> hard/annoying/difficult" is enough. >> >> (And a general reminder: the roadmap is a statement of intent, not a promise >> of timeline or even that a feature will actually be implemented) >> >> To keep this thread manageable, please can we avoid discussions _in this >> thread_ about ideas and keep +1/me too's to a minimum. >> >> Cheers, >> -ash -- +48 660 796 129
