Yeah. This is a very good and important point and one that i was thinking a lot about recently as we are gearing up to finalize AIP-44 POC work (still on me and i dropped a ball a bit on it due to Summit and 'around 2.3.0 - 1 activities but I am going to resume it very soon).
I agree with Ping that 'system admins' should have access to the db directly. While this is an internal detail of Airflow, the db is not a black box (we already expose a number 'properties' to people who really 'know' what they do (migrations, including automated generation of those) and for bigger installations especially when you need to optimize for your specific case, improve an index or simply groom your data, there will always be cases for that. But this should be done on a 'side' and it should have no more support from Airflow software and community than documenting the schema and keeping track of the changes (and having this admins aware that backwards compatibility is not really guaranteed and they have to be prepared for changes). On the other hand for users and 'tenant' admins (when we speak about future multi-tenancy) we should not expose db directly eventually and provide stable and reliable bulk API. But this is a process and. I'd say marathon rather than sprint. I think we definitely want to extend the API with bulk operations where we miss them. As part of the API-44 we might want to think about some mechanism that willake gradual migration possible for people who use such db queries but would like to start using db isolation mode. This is - I think not something that were will solve upfront, as there might be many ways users would like to access the DB. I think of it more as a process - i.e. finding a way when db isolation is on and the 'isolation' benefit is achieved, but on the other hand we do not require our users to do a 'big bang' approach that they have to migrate in on go all of their workload. I thought that we need some mechanism that will allow for exceptions - once db isolation mode is enabled - maybe even some whitelist of queries that you can enable and some way for the admins to see which queries were blocked and an easy way to enable them - maybe even in Airflow UI. I think in a number of cases no one in the whole organisation running Airflow will actually know what queries differebt stakeholders and users in their organisations use. So a mechanism that will allow to detect what queries are run, built inventory of those and allow to make exceptions for those would be helpful to manage the transition. I thought about a mechanism similar in a way to some local firewall approaches wher you get notified that someone runs unknown query and allow to add exception for it (likely there could be graduation of 'alow but log', 'log and disable with exceptions' and eventually 'lock and disable all no exceptions'. This will be easy for any code using Airflow sqlalchemy classes and i think that might be good enough actually (direct db queries would fall into 'super admins' mode. Then we could find a way to gather (possibly even semi automatically) an information from our users on what kind of Bulk API they need and either let them to contribute it or add it by the team. I think this should be part of designing the 'transition' process - i.e. we have to think about the 'transition scenarios' we give to users of Airflow who would like to eventually switch on the DB isolation. Part of it (and this is the difficult one i have no clear idea yet) is how t make such exceptions annoying enough a ND basically engineer the behaviour of the users who will choose the transition path so that they are incentivised to report such missing apis and switch to them when ready. It could be some performance penalties or maybe annoying notifications on UI or planned deprecation period or something similar, but we need to shave such mechanism otherwise most users will stuck in 'exceptions' stage for ever. I will put some thoughts there at our multi-tenancy talk with Mateusz - on Airflow Summit (next week - Wednesday all, last talk in Bangalore session) everyone is welcome to attend and ask questions - also happy to talk about it at Warsaw and London events in person. J. pt., 20 maj 2022, 05:55 użytkownik Ping Zhang <[email protected]> napisał: > Hi Rafal, > > In our production, we have a few bulk queries: > > 1. We have a cron job to query the replica of the airflow db for number of > tis in different states, the number of tasks in queued, running, and the > number of tasks whose state is changed to 'scheduled', 'running', 'queued' > in the last minute window (we have a task_instance trigger table to track > all state changes of tis). It then emits metrics that we have alerts on. > > 2. We have another job to periodically trim historical data in > task_instance, log, dag_run, etc tables, if their state is failed/success. > This prevents those tables from growing too big to have performance > impact. > > (when we do deletion, we carefully break down the query into smaller > chunks to avoid locking the table for too long, which might cause issues) > > 3. We have a script to easily reset failed tis given a period of time so > that when there is an infra failure, we can reset those failed tasks due to > infra failure. > > I hope this can help. As for the Airflow API part, I think it depends on > the use cases. If the use cases are for infra teams who manage the airflow > clusters, I am leaning towards not having an API. But if it is for other > users, it is better to have an API around it as we don't want them to have > access to prod db. > > Thanks, > > Ping > > > On Thu, May 19, 2022 at 10:31 AM Rafal Biegacz > <[email protected]> wrote: > >> Hi Airflow Community, >> >> Some heavy users of Airflow rely on the possibility to run bulk sql >> queries against Airflow DB. >> >> My gut feeling is that one of the reason they do that is to implement >> some additional visualizations and monitoring for Airflow tasks, DAGs and >> sense state of task scheduling. >> >> For example, I see that users do the following: >> >> - retrieving all tasks that are in the queued state >> - deleting all the tasks that are in the queued state for too long >> - query DB to retrieve info about all DAGs that failed in the >> specified period of time. >> - etc. >> >> Just wondering if we shouldn't implement support for some such queries >> via Airflow API to eliminate a need to query Airflow database directly. >> This request might be even more important in the context of AIP-44 >> Airflow Internal API >> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-44+Airflow+Internal+API> >> which >> tries to limit/control direct access to the Airflow Database. >> >> It would be great if you could share more info about such "bulk" queries >> that you do/saw in practice and it would be great to hear from you what you >> think about implementing support for some of them in Airflow API (stable or >> internal). >> >> Regards, Rafal. >> >
