Re: [DISCUSS] How to document DB access options in Airflow 3 upgrade docs

Amogh Desai Wed, 05 Nov 2025 23:25:37 -0800

> Yes, API could do this with 5-times more code including the limits per
response where you need to loop over all pages until you have a full
list (e.g. API limited to 100 results). Not impossible but a lot of
re-implementation.


Just wondering, why not vanilla task mapping?

> Might be something that could be a potential contributionto "airflow db
clean"

Maybe, yes.

Thanks & Regards,
Amogh Desai


On Thu, Nov 6, 2025 at 12:53 PM Amogh Desai <[email protected]> wrote:

> > I think our efforts should be way more focused on adding some missing API
> calls in Task SDK that our users miss, rather than in allowing them to use
> "old ways". Every time someone says "I cannot migrate because i did this",
> our first thought should be:
>
> * is it a valid way?
> * is it acceptable to have an API call for it in SDK?
> * should we do it ?
>
>
> That is currently a grey zone we need to define better I think. Certain
> use cases might be general
> enough that we need an execution API endpoint for that, and we can
> certainly do that. But there will
> also be cases when the use case is niche and we will NOT want to have
> execution API endpoints
> for that for various reasons. The harder problem to solve is the latter.
>
> But you make a fair point here.
>
>
>
> Thanks & Regards,
> Amogh Desai
>
>
> On Thu, Nov 6, 2025 at 2:33 AM Jens Scheffler <[email protected]> wrote:
>
>> > Thanks for your comments too, Jens.
>> >
>> >>    * Aggregate status of tasks in the upstream of same Dag (pass, fail,
>> >>      listing)
>> >>
>> >> Does the DAG run page not show that?
>> Partly yes, but in our environment it is a bit more complex than
>> "pass/fail". Bit more complex story, we want to know more details of the
>> failed and aggregate details. So high-level saying get the XCom from
>> failed and then aggregate details. Imagine all tasks ahve an owner and
>> we want to send a notification to each owner but if 10 tasks from one
>> owner fail we want to send 1 notification with 10 failed in the text.
>> And, yes, can be done via API.
>> >>    * Custom mass-triggering of other dags and collection of results
>> from
>> >>     triggered dags as scale-out option for dynamic task mapping
>> >>
>> >> Can't an API do that?
>> Yes, API could do this with 5-times more code including the limits per
>> response where you need to loop over all pages until you have a full
>> list (e.g. API limited to 100 results). Not impossible but a lot of
>> re-implementation.
>> >>    * And the famous: Partial database clean on a per Dag level with
>> >>      different retention
>> >>
>> >> Can you elaborate this one a bit :D
>>
>> Yes. We have some Dag that is called 50k-100k times per day and others
>> that are called 12 times a day. And a lot of others in-between like 25k
>> runs per month. The Dag with 100k runs per day we want to archive ASAP
>> probably after 3 days for all not failed calls to reduce DB overhead.
>> The failed ones we keep for 14 days for potential re-processing if there
>> was an outage.
>>
>> Most other Dag Runs we keep for a month. And some we cap that we archive
>> if more than 25k runs
>>
>> Might be something that could be a potential contributionto "airflow db
>> clean"
>>
>> >>
>> >> Thanks & Regards,
>> >> Amogh Desai
>> >>
>> >>
>> >> On Wed, Nov 5, 2025 at 3:12 AM Jens Scheffler <[email protected]>
>> wrote:
>> >>
>> >> Thanks Amough for adding docs for migration hints.
>> >>
>> >> We actually suffer a lot of integrations that had been built in the
>> past
>> >> which now makes it hard and serious effort to migrate to version 3. So
>> >> most probably we ourself need to take option 2 but knowing (like in the
>> >> past) that you can not ask for support. But at least this un-blocks us
>> >> from staying with 2.x
>> >>
>> >> I'd love to take route 1 as well but then a lot of code needs to be re
>> >> written. This will take time, And in mid term we will migrate to (1).
>> >>
>> >> As in the dev call I'd love if in Airflow 3.2 we could have option 1
>> >> supported out-of-the-box - knowing that some security discussion is
>> >> implied, so maybe need to be turned on and not be enabled by default.
>> >>
>> >> The use cases we have and which requires some kind of DB access where
>> >> TaskSDK is not helping with support
>> >>
>> >>    * Adding task and dag run notes to tasks as better readable status
>> >>      while and after execution
>> >>    * Aggregate status of tasks in the upstream of same Dag (pass, fail,
>> >>      listing)
>> >>    * Custom mass-triggering of other dags and collection of results
>> from
>> >>      triggered dags as scale-out option for dynamic task mapping
>> >>    * Adjusting Pools based on available workers
>> >>    * Checking results of pass/fail per edge worker and depending on
>> >>      stability adjusting Queues on Edge workers based on status and
>> >>      errors of workers
>> >>    * Adjust Pools based on time of day
>> >>    * And the famous: Partial database clean on a per Dag level with
>> >>      different retention
>> >>
>> >> I would be okay removing option 3 and a clear warning to option 2 is
>> >> also okay.
>> >>
>> >> Jens
>> >>
>> >> On 11/4/25 13:06, Jarek Potiuk wrote:
>> >>> My take (and details can be found in the discussion):
>> >>>
>> >>> 2. Don't make the impression it is something that we will support -
>> and
>> >>> explain to the users that it **WILL** break in the future and it's on
>> >>> **THEM** to fix when it breaks.
>> >>>
>> >>> The 2 is **kinda** possible but we should strongly discourage this and
>> >> say
>> >>> "this will break any time and it's you who have to adapt to any future
>> >>> changes in schema" - we had a lot of similar cases in the past where
>> our
>> >>> users felt entitled to get **something** they felt as "valid way of
>> using
>> >>> things" broken by our changes. If we say "recommended" they will take
>> it
>> >> as
>> >>> "and all the usage there is expected to work when Airlfow gets a new
>> >>> version so I should be fully entitled to open a valid issue when
>> things
>> >>> change".  I think "recommended" in this case is far too strong from
>> our
>> >>> side.
>> >>>
>> >>> 3. Absolutely remove.
>> >>>
>> >>> Sounds like we are going back to Airflow 2 behaviour. And we've made
>> all
>> >>> the effort to break out of that. Various things will start breaking in
>> >>> Airflow 3.2 and beyond. Once we complete the task isolation work,
>> Airflow
>> >>> workers will NOT have sqlalchemy package installed by default - it
>> simply
>> >>> will not be task-sdk dependency. The fact that you **can** use
>> sqlalchemy
>> >>> now is mostly a by-product of the fact that we have not completed the
>> >> split
>> >>> yet - but it was not even **SUPPOSED** to work.
>> >>>
>> >>> J.
>> >>>
>> >>>
>> >>>
>> >>> On Tue, Nov 4, 2025 at 10:03 AM Amogh Desai<[email protected]>
>> >> wrote:
>> >>>> Hi All,
>> >>>>
>> >>>> I'm working on expanding the Airflow 3 upgrade documentation to
>> address
>> >> a
>> >>>> frequently asked question from users
>> >>>> migrating from Airflow 2.x: "How do I access the metadata database
>> from
>> >> my
>> >>>> tasks now that direct database
>> >>>> access is blocked?"
>> >>>>
>> >>>> Currently, Step 5 of the upgrade guide[1] only mentions that direct
>> DB
>> >>>> access is blocked and points to a GitHub issue.
>> >>>> However, users need concrete guidance on migration options.
>> >>>>
>> >>>> I've drafted documentation via [2] describing three approaches, but
>> >> before
>> >>>> proceeding to finalising this, I'd like to get community
>> >>>> consensus on how we should present these options, especially given
>> the
>> >>>> architectural principles we've established with
>> >>>> Airflow 3.
>> >>>>
>> >>>> ## Proposed Approaches
>> >>>>
>> >>>> Approach 1: Airflow Python Client (REST API)
>> >>>> - Uses `apache-airflow-client` [3] to interact via REST API
>> >>>> - Pros: No DB drivers needed, aligned with Airflow 3 architecture,
>> >>>> API-first
>> >>>> - Cons: Requires package installation, API server dependency, auth
>> token
>> >>>> management, limited operations possible
>> >>>>
>> >>>> Approach 2: Database Hooks (PostgresHook/MySqlHook)
>> >>>> - Create a connection to metadata DB and use DB hooks to execute SQL
>> >>>> directly
>> >>>> - Pros: Uses Airflow connection management, simple SQL interface
>> >>>> - Cons: Requires DB drivers, direct network access, bypasses Airflow
>> API
>> >>>> server and connects to DB directly
>> >>>>
>> >>>> Approach 3: Direct SQLAlchemy Access (last resort)
>> >>>> - Use environment variable with DB connection string and create
>> >> SQLAlchemy
>> >>>> session directly
>> >>>> - Pros: Maximum flexibility
>> >>>> - Cons: Bypasses all Airflow protections, schema coupling, manual
>> >>>> connection management, worst possible option.
>> >>>>
>> >>>> I was expecting some pushback regarding these approaches and there
>> were
>> >>>> (rightly) some important concerns raised
>> >>>> by Jarek about Approaches 2 and 3:
>> >>>>
>> >>>> 1. Breaks Task Isolation - Contradicts Airflow 3's core promise
>> >>>> 2. DB as Public Interface - Schema changes would require release
>> notes
>> >> and
>> >>>> break user code
>> >>>> 3. Performance Impact - Using Approach 2 creates direct DB access and
>> >> can
>> >>>> bring back Airflow 2's
>> >>>> connection-per-task overhead
>> >>>> 4. Security Model Violation - Contradicts documented isolation
>> >> principles
>> >>>> Considering these comments, this is what I want to document now:
>> >>>>
>> >>>> 1. Approach 1 - Keep as primary/recommended solution (aligns with
>> >> Airflow 3
>> >>>> architecture)
>> >>>> 2. Approach 2 - Present as "known workaround" (not recommendation)
>> with
>> >>>> explicit warnings
>> >>>> about breaking isolation, schema not being public API, performance
>> >>>> implications, and no support guarantees
>> >>>> 3. Approach 3 - Remove entirely, or keep with strongest possible
>> >> warnings
>> >>>> (would love to hear what others think for
>> >>>> this one particularly)
>> >>>>
>> >>>> Once we arrive at some discussion points on this one, I would like to
>> >> call
>> >>>> for a lazy consensus for posterity and visibility
>> >>>> of the community.
>> >>>>
>> >>>> Looking forward to your feedback!
>> >>>>
>> >>>> [1]
>> >>>>
>> >>>>
>> >>
>> https://github.com/apache/airflow/blob/main/airflow-core/docs/installation/upgrading_to_airflow3.rst#step-5-review-custom-operators-for-direct-db-access
>> >>>> [2]https://github.com/apache/airflow/pull/57479
>> >>>> [3]https://github.com/apache/airflow-client-python
>> >>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>

Re: [DISCUSS] How to document DB access options in Airflow 3 upgrade docs

Reply via email to