> Yes, API could do this with 5-times more code including the limits per response where you need to loop over all pages until you have a full list (e.g. API limited to 100 results). Not impossible but a lot of re-implementation.
Just wondering, why not vanilla task mapping? > Might be something that could be a potential contributionto "airflow db clean" Maybe, yes. Thanks & Regards, Amogh Desai On Thu, Nov 6, 2025 at 12:53 PM Amogh Desai <[email protected]> wrote: > > I think our efforts should be way more focused on adding some missing API > calls in Task SDK that our users miss, rather than in allowing them to use > "old ways". Every time someone says "I cannot migrate because i did this", > our first thought should be: > > * is it a valid way? > * is it acceptable to have an API call for it in SDK? > * should we do it ? > > > That is currently a grey zone we need to define better I think. Certain > use cases might be general > enough that we need an execution API endpoint for that, and we can > certainly do that. But there will > also be cases when the use case is niche and we will NOT want to have > execution API endpoints > for that for various reasons. The harder problem to solve is the latter. > > But you make a fair point here. > > > > Thanks & Regards, > Amogh Desai > > > On Thu, Nov 6, 2025 at 2:33 AM Jens Scheffler <[email protected]> wrote: > >> > Thanks for your comments too, Jens. >> > >> >> * Aggregate status of tasks in the upstream of same Dag (pass, fail, >> >> listing) >> >> >> >> Does the DAG run page not show that? >> Partly yes, but in our environment it is a bit more complex than >> "pass/fail". Bit more complex story, we want to know more details of the >> failed and aggregate details. So high-level saying get the XCom from >> failed and then aggregate details. Imagine all tasks ahve an owner and >> we want to send a notification to each owner but if 10 tasks from one >> owner fail we want to send 1 notification with 10 failed in the text. >> And, yes, can be done via API. >> >> * Custom mass-triggering of other dags and collection of results >> from >> >> triggered dags as scale-out option for dynamic task mapping >> >> >> >> Can't an API do that? >> Yes, API could do this with 5-times more code including the limits per >> response where you need to loop over all pages until you have a full >> list (e.g. API limited to 100 results). Not impossible but a lot of >> re-implementation. >> >> * And the famous: Partial database clean on a per Dag level with >> >> different retention >> >> >> >> Can you elaborate this one a bit :D >> >> Yes. We have some Dag that is called 50k-100k times per day and others >> that are called 12 times a day. And a lot of others in-between like 25k >> runs per month. The Dag with 100k runs per day we want to archive ASAP >> probably after 3 days for all not failed calls to reduce DB overhead. >> The failed ones we keep for 14 days for potential re-processing if there >> was an outage. >> >> Most other Dag Runs we keep for a month. And some we cap that we archive >> if more than 25k runs >> >> Might be something that could be a potential contributionto "airflow db >> clean" >> >> >> >> >> Thanks & Regards, >> >> Amogh Desai >> >> >> >> >> >> On Wed, Nov 5, 2025 at 3:12 AM Jens Scheffler <[email protected]> >> wrote: >> >> >> >> Thanks Amough for adding docs for migration hints. >> >> >> >> We actually suffer a lot of integrations that had been built in the >> past >> >> which now makes it hard and serious effort to migrate to version 3. So >> >> most probably we ourself need to take option 2 but knowing (like in the >> >> past) that you can not ask for support. But at least this un-blocks us >> >> from staying with 2.x >> >> >> >> I'd love to take route 1 as well but then a lot of code needs to be re >> >> written. This will take time, And in mid term we will migrate to (1). >> >> >> >> As in the dev call I'd love if in Airflow 3.2 we could have option 1 >> >> supported out-of-the-box - knowing that some security discussion is >> >> implied, so maybe need to be turned on and not be enabled by default. >> >> >> >> The use cases we have and which requires some kind of DB access where >> >> TaskSDK is not helping with support >> >> >> >> * Adding task and dag run notes to tasks as better readable status >> >> while and after execution >> >> * Aggregate status of tasks in the upstream of same Dag (pass, fail, >> >> listing) >> >> * Custom mass-triggering of other dags and collection of results >> from >> >> triggered dags as scale-out option for dynamic task mapping >> >> * Adjusting Pools based on available workers >> >> * Checking results of pass/fail per edge worker and depending on >> >> stability adjusting Queues on Edge workers based on status and >> >> errors of workers >> >> * Adjust Pools based on time of day >> >> * And the famous: Partial database clean on a per Dag level with >> >> different retention >> >> >> >> I would be okay removing option 3 and a clear warning to option 2 is >> >> also okay. >> >> >> >> Jens >> >> >> >> On 11/4/25 13:06, Jarek Potiuk wrote: >> >>> My take (and details can be found in the discussion): >> >>> >> >>> 2. Don't make the impression it is something that we will support - >> and >> >>> explain to the users that it **WILL** break in the future and it's on >> >>> **THEM** to fix when it breaks. >> >>> >> >>> The 2 is **kinda** possible but we should strongly discourage this and >> >> say >> >>> "this will break any time and it's you who have to adapt to any future >> >>> changes in schema" - we had a lot of similar cases in the past where >> our >> >>> users felt entitled to get **something** they felt as "valid way of >> using >> >>> things" broken by our changes. If we say "recommended" they will take >> it >> >> as >> >>> "and all the usage there is expected to work when Airlfow gets a new >> >>> version so I should be fully entitled to open a valid issue when >> things >> >>> change". I think "recommended" in this case is far too strong from >> our >> >>> side. >> >>> >> >>> 3. Absolutely remove. >> >>> >> >>> Sounds like we are going back to Airflow 2 behaviour. And we've made >> all >> >>> the effort to break out of that. Various things will start breaking in >> >>> Airflow 3.2 and beyond. Once we complete the task isolation work, >> Airflow >> >>> workers will NOT have sqlalchemy package installed by default - it >> simply >> >>> will not be task-sdk dependency. The fact that you **can** use >> sqlalchemy >> >>> now is mostly a by-product of the fact that we have not completed the >> >> split >> >>> yet - but it was not even **SUPPOSED** to work. >> >>> >> >>> J. >> >>> >> >>> >> >>> >> >>> On Tue, Nov 4, 2025 at 10:03 AM Amogh Desai<[email protected]> >> >> wrote: >> >>>> Hi All, >> >>>> >> >>>> I'm working on expanding the Airflow 3 upgrade documentation to >> address >> >> a >> >>>> frequently asked question from users >> >>>> migrating from Airflow 2.x: "How do I access the metadata database >> from >> >> my >> >>>> tasks now that direct database >> >>>> access is blocked?" >> >>>> >> >>>> Currently, Step 5 of the upgrade guide[1] only mentions that direct >> DB >> >>>> access is blocked and points to a GitHub issue. >> >>>> However, users need concrete guidance on migration options. >> >>>> >> >>>> I've drafted documentation via [2] describing three approaches, but >> >> before >> >>>> proceeding to finalising this, I'd like to get community >> >>>> consensus on how we should present these options, especially given >> the >> >>>> architectural principles we've established with >> >>>> Airflow 3. >> >>>> >> >>>> ## Proposed Approaches >> >>>> >> >>>> Approach 1: Airflow Python Client (REST API) >> >>>> - Uses `apache-airflow-client` [3] to interact via REST API >> >>>> - Pros: No DB drivers needed, aligned with Airflow 3 architecture, >> >>>> API-first >> >>>> - Cons: Requires package installation, API server dependency, auth >> token >> >>>> management, limited operations possible >> >>>> >> >>>> Approach 2: Database Hooks (PostgresHook/MySqlHook) >> >>>> - Create a connection to metadata DB and use DB hooks to execute SQL >> >>>> directly >> >>>> - Pros: Uses Airflow connection management, simple SQL interface >> >>>> - Cons: Requires DB drivers, direct network access, bypasses Airflow >> API >> >>>> server and connects to DB directly >> >>>> >> >>>> Approach 3: Direct SQLAlchemy Access (last resort) >> >>>> - Use environment variable with DB connection string and create >> >> SQLAlchemy >> >>>> session directly >> >>>> - Pros: Maximum flexibility >> >>>> - Cons: Bypasses all Airflow protections, schema coupling, manual >> >>>> connection management, worst possible option. >> >>>> >> >>>> I was expecting some pushback regarding these approaches and there >> were >> >>>> (rightly) some important concerns raised >> >>>> by Jarek about Approaches 2 and 3: >> >>>> >> >>>> 1. Breaks Task Isolation - Contradicts Airflow 3's core promise >> >>>> 2. DB as Public Interface - Schema changes would require release >> notes >> >> and >> >>>> break user code >> >>>> 3. Performance Impact - Using Approach 2 creates direct DB access and >> >> can >> >>>> bring back Airflow 2's >> >>>> connection-per-task overhead >> >>>> 4. Security Model Violation - Contradicts documented isolation >> >> principles >> >>>> Considering these comments, this is what I want to document now: >> >>>> >> >>>> 1. Approach 1 - Keep as primary/recommended solution (aligns with >> >> Airflow 3 >> >>>> architecture) >> >>>> 2. Approach 2 - Present as "known workaround" (not recommendation) >> with >> >>>> explicit warnings >> >>>> about breaking isolation, schema not being public API, performance >> >>>> implications, and no support guarantees >> >>>> 3. Approach 3 - Remove entirely, or keep with strongest possible >> >> warnings >> >>>> (would love to hear what others think for >> >>>> this one particularly) >> >>>> >> >>>> Once we arrive at some discussion points on this one, I would like to >> >> call >> >>>> for a lazy consensus for posterity and visibility >> >>>> of the community. >> >>>> >> >>>> Looking forward to your feedback! >> >>>> >> >>>> [1] >> >>>> >> >>>> >> >> >> https://github.com/apache/airflow/blob/main/airflow-core/docs/installation/upgrading_to_airflow3.rst#step-5-review-custom-operators-for-direct-db-access >> >>>> [2]https://github.com/apache/airflow/pull/57479 >> >>>> [3]https://github.com/apache/airflow-client-python >> >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >>
