Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Pavankumar Gopidesu Mon, 29 Dec 2025 01:40:26 -0800

Thanks Jens and Jarek, agree on both points raised in comments.

I am happy to defer the embedding of the HITL to separate AIP.


To Jens:
 Yes it's planned for phases wise, our plan starts with only provider changes.

Regards
Pavan

On Sun, Dec 28, 2025 at 2:03 PM Jarek Potiuk <[email protected]> wrote:
>
> I also looked at it and I love it as well. I think of it as a missing
> abstraction between current Airflow users and current LLM app developers, I
> also proposed something a little bit bolder there, which I think shows the
> true potential of that approach.
> I added comment in the doc, but I will copy it here for better visibility
>
> ---
>
> After thinking quite a bit about the proposal, I actually love it and I
> think that should be the next frontier of making Airflow abstractions more
> approachable and usable by those who want to implement various patterns of
> interacting with LLMS.
>
> And I have a little different opinion than Jens regarding HITL. I see those
> common LLM operators as slightly "higher" level operators that might
> implement a set of common LLM-related patterns that are currently either
> difficult or impossible to express via putting together things via Dag and
> individual tasks. In this sense, the capability of making HITL call-out for
> approval or selection from within such an operator - without completing the
> operator and even running those "call-outs" more than once, actually even
> unbounded number of times during a single operator's execution.
>
> Actually it's a great way for us to implement some "cyclicness" - without
> breaking the "acyclic" property of our Dags (for now at least). Making Dag
> "cyclic" is quite a dramatic change, and possibly we do not even have to do
> it, because the "cyclic" part can be likely encompassed within the
> specialized LLM operators. I can imagine an operator that performs LLM
> querying and refining it via additional interactions with LLMs "internally"
> - during a single operator's execution. And some of those iterations might
> result in HITL "call-out" - even multiple times during one execution.
>
> Also one more proposal I have here is to use an API similar to HITL (or
> maybe repurpose HITL for that) - to report PROGRESS of such a task. This is
> the typical property of good LLM task that it provides some feedback to the
> user - it might be HITL when it asks for something but also it might be
> HOOTL (Human Outside Of The Loop) - where the task is simply reporting it's
> progress and allows the user to perform asynchronous actions based on that
> progress → for example abort the execution (to stop the Dag) or mark it as
> "skipped" (to trigger - skip processing path), or mark it as "success" to
> simulate things being completed when they are not. While the three "async"
> operations we already have, we do not currently have "progress" targeted
> for the kind of actor who is also HITL "actor" - someone who is not
> interested in detailed logs, but rather want to monitor progress and assess
> quality of the output - even if it is just a partial output in the
> iterative process).
>
> I think that it will be easier and much more "surgical" (and applied in the
> right place) to embed this "iterative" feedback / progress than to modify
> the "acyclic" property into our Dags.
>
> Also - this kind of Progress interface can also be used to publish the
> "async" tasks progress as the next step of [WIP] AIP-98: Add async support
> for PythonOperator in Airflow 3:
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-98%3A+Add+async+support+for+PythonOperator+in+Airflow+3
> that we discussed with David  .
>
> J.
>
>
>
> On Sun, Dec 28, 2025 at 2:16 PM Jens Scheffler <[email protected]> wrote:
>
> > I like the AIP very much and in my view can be made completely in a
> > Provider package... with some comments (I assume non blocking) and would
> > propose to really start in increments and then adjust by learning on the
> > path.
> >
> > On 12/27/25 22:00, Pavankumar Gopidesu wrote:
> > > Thanks Giorgio Zoppi, for reviewing the AIP, yes its already planned
> > > part of this AIP, see the [1] example , where you can disable hitl
> > > step or enable it. So its integrated part of the Operator with the
> > > help of HITL operator.
> > >
> > > ```
> > > LLMDataQualityOperator(
> > >
> > >      task_id="customer_quality_analysis",
> > >
> > >      data_sources=[customer_s3],
> > >
> > >      prompt="Generate data quality validation queries",
> > >
> > >      require_approval=True,  # Built-in HITL
> > >
> > >      approval_timeout=timedelta(hours=2)
> > >
> > > )
> > > ```
> > >
> > > [1]:
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> > >
> > > Regards,
> > > Pavan
> > >
> > > On Sat, Dec 27, 2025 at 9:16 AM Giorgio Zoppi <[email protected]>
> > wrote:
> > >> Hello,
> > >> Just 1c, skimming AIP,
> > >> You might  want to explore on how to avoid human approval for generated
> > >> query using llm as judge to eval the quality. The nice thing of data
> > >> pipelines is automation
> > >>
> > >>
> > >>
> > >>
> > >> On Wed, Dec 24, 2025, 10:23 Pavankumar Gopidesu <
> > [email protected]>
> > >> wrote:
> > >>
> > >>> Hello everyone,
> > >>>
> > >>> The thread has been quiet for some time, and I would like to restart
> > >>> the discussion with the AIP.
> > >>>
> > >>> First, a sincere thank you to Kaxil for presenting the idea at Airflow
> > >>> Summit 2025. The session was very well received, and many attendees
> > >>> expressed strong interest in the proposal. Unfortunately, I was unable
> > >>> to attend the summit due to visa issues, but I am hopeful I will be
> > >>> able to join next year.
> > >>>
> > >>> The demo included well-structured prototypes. For those who were
> > >>> unable to attend the session, please refer to the recorded talk here
> > >>> [1].
> > >>>
> > >>> I have also drafted the complete AIP proposal, which is available here
> > >>> [2]. I would greatly appreciate your reviews and look forward to
> > >>> feedback and further discussion.
> > >>>
> > >>> Finally, to those celebrating Christmas, I wish you a very happy
> > >>> Christmas and a wonderful holiday season.
> > >>>
> > >>> Regards
> > >>> Pavan
> > >>>
> > >>> [1] https://www.youtube.com/watch?v=XSAzSDVUi2o
> > >>> [2]
> > >>>
> > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> > >>>
> > >>> On Wed, Oct 15, 2025 at 6:13 AM Amogh Desai <[email protected]>
> > wrote:
> > >>>> Thanks Pavan and Kaxil, seems like an interesting idea and a pretty
> > >>>> reasonable problem to solve.
> > >>>>
> > >>>> I also like the idea of starting with
> > >>> `apache-airflow-providers-common-ai`
> > >>>> and expanding as / when needed.
> > >>>>
> > >>>> Looking forward to when the recording will be out, missed attending
> > this
> > >>>> session at the Airflow Summit.
> > >>>>
> > >>>> Thanks & Regards,
> > >>>> Amogh Desai
> > >>>>
> > >>>>
> > >>>> On Thu, Oct 9, 2025 at 10:49 AM Kaxil Naik <[email protected]>
> > wrote:
> > >>>>
> > >>>>> Yea I think it should be apache-airflow-providers-common-ai
> > >>>>>
> > >>>>> On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <
> > >>> [email protected]>
> > >>>>> wrote:
> > >>>>>
> > >>>>>> Yes its new provider starting with completely experimental, we dont
> > >>>>>> want to break functionalities with existing providers :)
> > >>>>>>
> > >>>>>> Mostly its sql based operators, so named it as sql-ai but agree we
> > >>> can
> > >>>>>> make it generic without specifying sql in it :)
> > >>>>>>
> > >>>>>> Pavan
> > >>>>>>
> > >>>>>> On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
> > >>>>>> <[email protected]> wrote:
> > >>>>>>> Would this really necessitate a new provider? Should this just be
> > >>> baked
> > >>>>>>> into the common SQL provider?
> > >>>>>>>
> > >>>>>>> Alternatively, instead of a narrow `sql-ai` provider, why not have
> > >>> a
> > >>>>>>> generic common ai provider with a SQL package, which would allow
> > >>> for us
> > >>>>>> to
> > >>>>>>> build AI-based subpackages into the provider other than just SQL?
> > >>>>>>>
> > >>>>>>> On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <
> > >>>>>> [email protected]>
> > >>>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> @Giorgio Yes indeed that's also a good thought to integrate. I
> > >>> will
> > >>>>>> keep in
> > >>>>>>>> mind to think about when I draft AIP and message about this a bit
> > >>>>> more
> > >>>>>> :)
> > >>>>>>>> Yes please join. We have great demos packed on this topic :)
> > >>>>>>>>
> > >>>>>>>> @kaxil , Yes that's a great blog post from the wren AI and
> > >>> leveraging
> > >>>>>> the
> > >>>>>>>> Apache DataFusion as a query engine to connect to different data
> > >>>>>> sources.
> > >>>>>>>> Pavan
> > >>>>>>>>
> > >>>>>>>> On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <
> > >>>>> [email protected]
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Hey Pavan,
> > >>>>>>>>> Some notes:
> > >>>>>>>>> 1. LLM can be also very useful in detecting root causes of your
> > >>>>> error
> > >>>>>>>> while
> > >>>>>>>>> developing and design a pipeline. I explain me better, we'd in
> > >>> the
> > >>>>>> past
> > >>>>>>>>> several
> > >>>>>>>>> Spark processes, when it is all green is ok, but when on
> > >>> fails, it
> > >>>>>> will
> > >>>>>>>> be
> > >>>>>>>>> nice to have a tool integrated to ask why.
> > >>>>>>>>> 2. Ideally such operator could be a
> > >>> ModelContextProtocolOperator
> > >>>>> and
> > >>>>>> you
> > >>>>>>>>> would not need nothing else that put an LLM as parameter with
> > >>> that
> > >>>>>>>>> operator,
> > >>>>>>>>> and just call for tools, execute query, and so on. This would
> > >>> be
> > >>>>> more
> > >>>>>>>>> powerful, because you create an abstraction between devices,
> > >>>>>> databases,
> > >>>>>>>>> server and so on, so each source of data can be injected on the
> > >>>>>> pipeline.
> > >>>>>>>>> 3.  Good job! Looking forward to see the presentation.
> > >>>>>>>>> Best Regards,
> > >>>>>>>>> Giorgio
> > >>>>>>>>>
> > >>>>>>>>> Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar Gopidesu <
> > >>>>>>>>> [email protected]> ha scritto:
> > >>>>>>>>>
> > >>>>>>>>>> Hi everyone,
> > >>>>>>>>>>
> > >>>>>>>>>> We're exploring adding LLM-powered SQL operators to Airflow
> > >>> and
> > >>>>>> would
> > >>>>>>>>> love
> > >>>>>>>>>> community input before writing an AIP.
> > >>>>>>>>>>
> > >>>>>>>>>> The idea: Let users write natural language prompts like "find
> > >>>>>> customers
> > >>>>>>>>>> with missing emails" and have Airflow generate safe SQL
> > >>> queries
> > >>>>>> with
> > >>>>>>>> full
> > >>>>>>>>>> context about your database schema, connections, and data
> > >>>>>> sensitivity.
> > >>>>>>>>>> Why this matters:
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Most of us spend too much time on schema drift detection and
> > >>>>> manual
> > >>>>>>>> data
> > >>>>>>>>>> quality checks. Meanwhile, AI agents are getting powerful but
> > >>>>> lack
> > >>>>>>>>>> production-ready data integrations. Airflow could bridge this
> > >>>>> gap.
> > >>>>>>>>>> Here's what we're dealing with at Tavant:
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> Our team works with multiple data domain teams producing
> > >>> data in
> > >>>>>>>>> different
> > >>>>>>>>>> formats and storage across S3, PostgreSQL, Iceberg, and
> > >>> Aurora.
> > >>>>>> When
> > >>>>>>>> data
> > >>>>>>>>>> assets become available for consumption, we need:
> > >>>>>>>>>>
> > >>>>>>>>>> - Detection of breaking schema changes between systems
> > >>>>>>>>>>
> > >>>>>>>>>> - Data quality assessments between snapshots
> > >>>>>>>>>>
> > >>>>>>>>>> - Validation that assets meet mandatory metadata requirements
> > >>>>>>>>>>
> > >>>>>>>>>> - Lookup validation against existing data (comparing file
> > >>> feeds
> > >>>>>> with
> > >>>>>>>>>> different formats to existing data in Iceberg/Aurora)
> > >>>>>>>>>>
> > >>>>>>>>>> This is exactly the type of work that LLMs  could automate
> > >>> while
> > >>>>>>>>>> maintaining governance.
> > >>>>>>>>>>
> > >>>>>>>>>> What we're thinking:
> > >>>>>>>>>>
> > >>>>>>>>>> ```python
> > >>>>>>>>>>
> > >>>>>>>>>> # Instead of writing complex SQL by hand...
> > >>>>>>>>>>
> > >>>>>>>>>> quality_check = LLMSQLQueryOperator(
> > >>>>>>>>>>
> > >>>>>>>>>>      task_id="find_data_issues",
> > >>>>>>>>>>
> > >>>>>>>>>>      prompt="Find customers with invalid email formats and
> > >>> missing
> > >>>>>> phone
> > >>>>>>>>>> numbers",
> > >>>>>>>>>>
> > >>>>>>>>>>      data_sources=[customer_asset],  # Airflow knows the
> > >>> schema
> > >>>>>>>>>> automatically
> > >>>>>>>>>>
> > >>>>>>>>>>      # Built-in safety: won't generate DROP/DELETE statements
> > >>>>>>>>>>
> > >>>>>>>>>> )
> > >>>>>>>>>>
> > >>>>>>>>>> ```
> > >>>>>>>>>>
> > >>>>>>>>>> The operator would:
> > >>>>>>>>>>
> > >>>>>>>>>> - Auto-inject database schema, sample data, and connection
> > >>>>> details
> > >>>>>>>>>> - Generate safe SQL (blocks dangerous operations)
> > >>>>>>>>>>
> > >>>>>>>>>> - Work across PostgreSQL, Snowflake, BigQuery with dialect
> > >>>>>> awareness
> > >>>>>>>>>> - Support schema drift detection between systems
> > >>>>>>>>>>
> > >>>>>>>>>> - Handle multi-cloud data via Apache DataFusion[1] (Did some
> > >>>>>>>> experiments
> > >>>>>>>>>> with 50M+          records and results are in 10-15 seconds
> > >>> for
> > >>>>>> common
> > >>>>>>>>>> aggregations)
> > >>>>>>>>>>
> > >>>>>>>>>> for more info on benchmarks [2]
> > >>>>>>>>>>
> > >>>>>>>>>> Key benefit: Assets become smarter with structured metadata
> > >>>>>> (schema,
> > >>>>>>>>>> sensitivity, format) instead of just throwing everything in
> > >>>>>> `extra`.
> > >>>>>>>>>> Implementation plan:
> > >>>>>>>>>>
> > >>>>>>>>>> Start with a separate provider
> > >>>>> (`apache-airflow-providers-sql-ai`)
> > >>>>>> so
> > >>>>>>>> we
> > >>>>>>>>>> can iterate without touching the Airflow core. No breaking
> > >>>>> changes,
> > >>>>>>>> works
> > >>>>>>>>>> with existing connections and hooks.
> > >>>>>>>>>>
> > >>>>>>>>>> I am presenting this at Airflow Summit 2025 in Seattle with
> > >>>>> Kaxil -
> > >>>>>>>> come
> > >>>>>>>>>> see the live demo!
> > >>>>>>>>>>
> > >>>>>>>>>> Next steps:
> > >>>>>>>>>>
> > >>>>>>>>>> If this resonates after the Summit, we'll write a proper AIP
> > >>> with
> > >>>>>>>>> technical
> > >>>>>>>>>> details and further build a working prototype.
> > >>>>>>>>>>
> > >>>>>>>>>> Thoughts? Concerns? Better ideas?
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> [1]: https://datafusion.apache.org/
> > >>>>>>>>>>
> > >>>>>>>>>> [2]:
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>
> > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Pavan
> > >>>>>>>>>>
> > >>>>>>>>>> P.S. - Happy to share more technical details with anyone
> > >>>>>> interested.
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Life is a chess game - Anonymous.
> > >>>>>>>>>
> > >>>>>>
> > ---------------------------------------------------------------------
> > >>>>>> To unsubscribe, e-mail: [email protected]
> > >>>>>> For additional commands, e-mail: [email protected]
> > >>>>>>
> > >>>>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: [email protected]
> > >>> For additional commands, e-mail: [email protected]
> > >>>
> > >>>
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Reply via email to