I second Jarek's point on making Airflow's components and workloads more isolated.
Another big pain point for us is cross DAG scheduling and dependency management. Thanks, QP Hou On Tue, Jun 15, 2021 at 5:45 AM Jarek Potiuk <[email protected]> wrote: > > I think those are the most important "big" things to solve "soon-ish" > and I would love to be part of it or even lead some of that: > > 1) General "isolation" problem of Airflow components and workloads. > This is necessary for multi-tenancy which is a highly requested > feature. The main list of things to address: > * core scheduler can do anything but the parser part of it should not > be able to access DB and parsing DAGs should run in a sandbox. > * workers should not be able to talk to the DB directly. They should > have a very strict small API to talk back to the scheduler. > * each task should be executed in its own sandbox even if the same > worker is reused > * (concurring with Elad) we should be able to define namespaces for > connections/variables/xcoms etc. (part of the worker/scheduler API) > * the webserver should never execute any user code (already solved I believe) > > 2) Add `binary` caching and task affinity mechanisms to optimise > workload execution. This could be used for a number of things for > "temporary/local/artifact intensive" tasks > * per-task virtualenvs (cached between executions) > * caching machine learning models to share between tasks run on the same > machine > * optimising preparation of rarely changed binary data shared between > multiple machine > > 3) Implement support for read DB replicas/maybe even separate > schema/separate databases for various parts of the model. > > That might further lift the (potential) limits of scalability of > Airflow, where likely a lot of DB operations for Airflow could be > redirected to a read replica and off-load the transactional DB for > more scheduling actions. Not sure if this is a real limitation now, so > we might not need that. > > 4) Or maybe even add support for Active-Active DB setup (though that > might not be needed after 3) gets implemented and 3) seems like an > easier one. > > 5) Full async support for Airflow - part of it is coming in Deferred > Tasks, but it addresses only part of the problem. Many tasks/operators > in Airflow are async in nature (and they basically wait for the > external services to complete), so intra-task async support could be a > good improvement. > > 6) Big Hairy Thing - get rid of relational DB. Rather very long term > but if we do that, we could do it instead of 3,4. I think about a > bolder move of modernising the "Heart" of Airflow and replacing the > Relational DB with a more cloud-native approach. I don't really know > what could be a better replacement (to also fit well with on-premise > cases), but using classic relational DB feels a bit out of place (just > a bit and I have no strong feelings about it). > > > > On Tue, Jun 15, 2021 at 2:05 PM Elad Kalif <[email protected]> wrote: > > > > I find working with sensitive Connections in Airflow quite difficult. In > > simple words - once a connection is defined anyone who knows about it can > > use it. This is a problem when you work with sensitive data like HR or > > finance. > > The issue is not about storing the connection details securely but rather > > once defined - who can use it? how to prevent from unauthorized users to > > access it? > > I would love to see the concept of namespace/areas in Airflow so > > Pools/Connections/Variables/Dags and user login are all associated to a > > specific namespace/areas. Kinda similar to the namespace concept in K8s I > > guess. > > > > For the moment we solved it by having two separate Airflow instances (one > > regular and one for sensitive data) but this is very difficult to maintain. > > > > On Tue, Jun 15, 2021 at 12:54 PM Ash Berlin-Taylor <[email protected]> wrote: > >> > >> Hi everyone, > >> > >> As I'm sure many of you are aware I (along with Aizhamal) am giving the > >> opening keynote at this year's Airflow summit, and I'm covering "what's > >> next after 2.0" -- essentially what is the roadmap for Airflow for the > >> next 12-18 months. > >> > >> Since Airflow is a community project first and foremost I'd like to get > >> all your ideas, no matter how off the wall :) > >> > >> I've got my own ideaas, and 2.2 is fairly firm already (AIPs 39 and 40), > >> but 2.3 and beyond starts to get less clear, so if you have something that > >> you'd like to see Airflow be able to do or do better, now is the time to > >> speak up. > >> > >> You don't have to have a solution, just "I find doing X > >> hard/annoying/difficult" is enough. > >> > >> (And a general reminder: the roadmap is a statement of intent, not a > >> promise of timeline or even that a feature will actually be implemented) > >> > >> To keep this thread manageable, please can we avoid discussions _in this > >> thread_ about ideas and keep +1/me too's to a minimum. > >> > >> Cheers, > >> -ash > > > > -- > +48 660 796 129
