Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Pavankumar Gopidesu Tue, 13 Jan 2026 14:53:47 -0800

Thanks Alex,

I agree that evals will be a core part of the operator implementations.
haven’t yet fully thought through the structure or how best to expose and
serve evals across operators, so your perspective is very timely. The idea
of a BaseEvals operator is interesting as well.


Thank you for offering your support, we’ll definitely take you up on that.
I’ll reach out when we move into implementation so we can definitely
collaborate on this.

Regards.
Pavan




On Tue, Jan 13, 2026 at 11:43 AM Alex <[email protected]> wrote:

> Thanks Pavan, this thread and the AIP are awesome!
>
> I've been starting to use and advocate an eval-first approach (including a
> lightning talk in the Airflow Summit [1]), for not just traditional
> software developers but new builders from other domains (So I can't just
> say "it's like TDD with integration tests for AI apps") and I'd be happy to
> help build the evals for, test, design or brainstorm components in this
> space.
>
> I firmly believe evals are a key area and I'm starting to contact the MCP
> server pioneers I met at last summit so we can experiment building a
> testbed [2] to evaluate operators/agents/mcps/skills.
>
> Including a BaseEvals operator (Which I believe differs from the goal of
> LLMDataQualityOperator) in the proposal might be worth it (unless the evals
> scope deserves its own place).
>
> Any specific area where you'd like support?
>
> Thanks,
> Alex
>
> - [1]
>
> https://alexhans.github.io/posts/talk.toward-a-shared-vision-of-llm-evals-in-airflow-ecosystem.html
> - [2] https://github.com/Alexhans/evals-playground
>
> On Thu, Jan 8, 2026 at 9:43 PM Pavankumar Gopidesu <
> [email protected]>
> wrote:
>
> > Thanks Niko, for reviewing .
> >
> > For now I am moving the cycliness implementation to future scope,
> > maybe a new AIP to bring this in and rethink on this.
> >
> > Regards,
> > Pavan
> >
> > On Wed, Jan 7, 2026 at 9:59 PM Oliveira, Niko <[email protected]>
> wrote:
> > >
> > > I read through the AIP and I like the idea a lot! I see both sides of
> > where to put the HITL portion. But I think that's something we can adjust
> > one way or another (in an additive way), so if we fine out that it's not
> > the right fit later, we can pivot.
> > >
> > > ________________________________
> > > From: Pavankumar Gopidesu <[email protected]>
> > > Sent: Monday, January 5, 2026 9:08:00 AM
> > > To: [email protected]
> > > Subject: RE: [EXT] AI-Native Airflow - LLM-Driven Intelligence for
> > Production Data Workflows
> > >
> > > CAUTION: This email originated from outside of the organization. Do not
> > click links or open attachments unless you can confirm the sender and
> know
> > the content is safe.
> > >
> > >
> > >
> > > AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur
> > externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si
> vous
> > ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas
> > certain que le contenu ne présente aucun risque.
> > >
> > >
> > >
> > > Yes Zoppi, as mentioned kaxil, we will be using PydanticAI and it
> > > provides nice interfaces to integrate validations etc;
> > >
> > > Pavan
> > >
> > > On Wed, Dec 31, 2025 at 2:28 AM Kaxil Naik <[email protected]>
> wrote:
> > > >
> > > > Evals will be part of it as this will be built on top of PydanticAI
> > that
> > > > supports it.
> > > >
> > > > On Mon, 29 Dec 2025 at 19:03, Giorgio Zoppi <[email protected]
> >
> > wrote:
> > > >
> > > > > Hey Pavan.
> > > > > If you are going to introduce this have you thought at the
> evaluation
> > > > > framework?
> > > > > How  do you evaluate the LLm operator?
> > > > >
> > > > > On Mon, Dec 29, 2025, 09:40 Pavankumar Gopidesu <
> > [email protected]>
> > > > > wrote:
> > > > >
> > > > > > Thanks Jens and Jarek, agree on both points raised in comments.
> > > > > >
> > > > > > I am happy to defer the embedding of the HITL to separate AIP.
> > > > > >
> > > > > > To Jens:
> > > > > >  Yes it's planned for phases wise, our plan starts with only
> > provider
> > > > > > changes.
> > > > > >
> > > > > > Regards
> > > > > > Pavan
> > > > > >
> > > > > > On Sun, Dec 28, 2025 at 2:03 PM Jarek Potiuk <[email protected]>
> > wrote:
> > > > > > >
> > > > > > > I also looked at it and I love it as well. I think of it as a
> > missing
> > > > > > > abstraction between current Airflow users and current LLM app
> > > > > > developers, I
> > > > > > > also proposed something a little bit bolder there, which I
> think
> > shows
> > > > > > the
> > > > > > > true potential of that approach.
> > > > > > > I added comment in the doc, but I will copy it here for better
> > > > > visibility
> > > > > > >
> > > > > > > ---
> > > > > > >
> > > > > > > After thinking quite a bit about the proposal, I actually love
> > it and I
> > > > > > > think that should be the next frontier of making Airflow
> > abstractions
> > > > > > more
> > > > > > > approachable and usable by those who want to implement various
> > patterns
> > > > > > of
> > > > > > > interacting with LLMS.
> > > > > > >
> > > > > > > And I have a little different opinion than Jens regarding HITL.
> > I see
> > > > > > those
> > > > > > > common LLM operators as slightly "higher" level operators that
> > might
> > > > > > > implement a set of common LLM-related patterns that are
> currently
> > > > > either
> > > > > > > difficult or impossible to express via putting together things
> > via Dag
> > > > > > and
> > > > > > > individual tasks. In this sense, the capability of making HITL
> > call-out
> > > > > > for
> > > > > > > approval or selection from within such an operator - without
> > completing
> > > > > > the
> > > > > > > operator and even running those "call-outs" more than once,
> > actually
> > > > > even
> > > > > > > unbounded number of times during a single operator's execution.
> > > > > > >
> > > > > > > Actually it's a great way for us to implement some
> "cyclicness" -
> > > > > without
> > > > > > > breaking the "acyclic" property of our Dags (for now at least).
> > Making
> > > > > > Dag
> > > > > > > "cyclic" is quite a dramatic change, and possibly we do not
> even
> > have
> > > > > to
> > > > > > do
> > > > > > > it, because the "cyclic" part can be likely encompassed within
> > the
> > > > > > > specialized LLM operators. I can imagine an operator that
> > performs LLM
> > > > > > > querying and refining it via additional interactions with LLMs
> > > > > > "internally"
> > > > > > > - during a single operator's execution. And some of those
> > iterations
> > > > > > might
> > > > > > > result in HITL "call-out" - even multiple times during one
> > execution.
> > > > > > >
> > > > > > > Also one more proposal I have here is to use an API similar to
> > HITL (or
> > > > > > > maybe repurpose HITL for that) - to report PROGRESS of such a
> > task.
> > > > > This
> > > > > > is
> > > > > > > the typical property of good LLM task that it provides some
> > feedback to
> > > > > > the
> > > > > > > user - it might be HITL when it asks for something but also it
> > might be
> > > > > > > HOOTL (Human Outside Of The Loop) - where the task is simply
> > reporting
> > > > > > it's
> > > > > > > progress and allows the user to perform asynchronous actions
> > based on
> > > > > > that
> > > > > > > progress → for example abort the execution (to stop the Dag) or
> > mark it
> > > > > > as
> > > > > > > "skipped" (to trigger - skip processing path), or mark it as
> > "success"
> > > > > to
> > > > > > > simulate things being completed when they are not. While the
> > three
> > > > > > "async"
> > > > > > > operations we already have, we do not currently have "progress"
> > > > > targeted
> > > > > > > for the kind of actor who is also HITL "actor" - someone who is
> > not
> > > > > > > interested in detailed logs, but rather want to monitor
> progress
> > and
> > > > > > assess
> > > > > > > quality of the output - even if it is just a partial output in
> > the
> > > > > > > iterative process).
> > > > > > >
> > > > > > > I think that it will be easier and much more "surgical" (and
> > applied in
> > > > > > the
> > > > > > > right place) to embed this "iterative" feedback / progress than
> > to
> > > > > modify
> > > > > > > the "acyclic" property into our Dags.
> > > > > > >
> > > > > > > Also - this kind of Progress interface can also be used to
> > publish the
> > > > > > > "async" tasks progress as the next step of [WIP] AIP-98: Add
> > async
> > > > > > support
> > > > > > > for PythonOperator in Airflow 3:
> > > > > > >
> > > > > >
> > > > >
> >
> https://cwiki.apache.org/confluence/display/AIRFLOW/%5BWIP%5D+AIP-98%3A+Add+async+support+for+PythonOperator+in+Airflow+3
> > > > > > > that we discussed with David  .
> > > > > > >
> > > > > > > J.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Sun, Dec 28, 2025 at 2:16 PM Jens Scheffler <
> > [email protected]>
> > > > > > wrote:
> > > > > > >
> > > > > > > > I like the AIP very much and in my view can be made
> completely
> > in a
> > > > > > > > Provider package... with some comments (I assume non
> blocking)
> > and
> > > > > > would
> > > > > > > > propose to really start in increments and then adjust by
> > learning on
> > > > > > the
> > > > > > > > path.
> > > > > > > >
> > > > > > > > On 12/27/25 22:00, Pavankumar Gopidesu wrote:
> > > > > > > > > Thanks Giorgio Zoppi, for reviewing the AIP, yes its
> already
> > > > > planned
> > > > > > > > > part of this AIP, see the [1] example , where you can
> > disable hitl
> > > > > > > > > step or enable it. So its integrated part of the Operator
> > with the
> > > > > > > > > help of HITL operator.
> > > > > > > > >
> > > > > > > > > ```
> > > > > > > > > LLMDataQualityOperator(
> > > > > > > > >
> > > > > > > > >      task_id="customer_quality_analysis",
> > > > > > > > >
> > > > > > > > >      data_sources=[customer_s3],
> > > > > > > > >
> > > > > > > > >      prompt="Generate data quality validation queries",
> > > > > > > > >
> > > > > > > > >      require_approval=True,  # Built-in HITL
> > > > > > > > >
> > > > > > > > >      approval_timeout=timedelta(hours=2)
> > > > > > > > >
> > > > > > > > > )
> > > > > > > > > ```
> > > > > > > > >
> > > > > > > > > [1]:
> > > > > > > >
> > > > > >
> > > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Pavan
> > > > > > > > >
> > > > > > > > > On Sat, Dec 27, 2025 at 9:16 AM Giorgio Zoppi <
> > > > > > [email protected]>
> > > > > > > > wrote:
> > > > > > > > >> Hello,
> > > > > > > > >> Just 1c, skimming AIP,
> > > > > > > > >> You might  want to explore on how to avoid human approval
> > for
> > > > > > generated
> > > > > > > > >> query using llm as judge to eval the quality. The nice
> > thing of
> > > > > data
> > > > > > > > >> pipelines is automation
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >>
> > > > > > > > >> On Wed, Dec 24, 2025, 10:23 Pavankumar Gopidesu <
> > > > > > > > [email protected]>
> > > > > > > > >> wrote:
> > > > > > > > >>
> > > > > > > > >>> Hello everyone,
> > > > > > > > >>>
> > > > > > > > >>> The thread has been quiet for some time, and I would like
> > to
> > > > > > restart
> > > > > > > > >>> the discussion with the AIP.
> > > > > > > > >>>
> > > > > > > > >>> First, a sincere thank you to Kaxil for presenting the
> > idea at
> > > > > > Airflow
> > > > > > > > >>> Summit 2025. The session was very well received, and many
> > > > > attendees
> > > > > > > > >>> expressed strong interest in the proposal. Unfortunately,
> > I was
> > > > > > unable
> > > > > > > > >>> to attend the summit due to visa issues, but I am hopeful
> > I will
> > > > > be
> > > > > > > > >>> able to join next year.
> > > > > > > > >>>
> > > > > > > > >>> The demo included well-structured prototypes. For those
> > who were
> > > > > > > > >>> unable to attend the session, please refer to the
> recorded
> > talk
> > > > > > here
> > > > > > > > >>> [1].
> > > > > > > > >>>
> > > > > > > > >>> I have also drafted the complete AIP proposal, which is
> > available
> > > > > > here
> > > > > > > > >>> [2]. I would greatly appreciate your reviews and look
> > forward to
> > > > > > > > >>> feedback and further discussion.
> > > > > > > > >>>
> > > > > > > > >>> Finally, to those celebrating Christmas, I wish you a
> very
> > happy
> > > > > > > > >>> Christmas and a wonderful holiday season.
> > > > > > > > >>>
> > > > > > > > >>> Regards
> > > > > > > > >>> Pavan
> > > > > > > > >>>
> > > > > > > > >>> [1] https://www.youtube.com/watch?v=XSAzSDVUi2o
> > > > > > > > >>> [2]
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > > >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=406618285
> > > > > > > > >>>
> > > > > > > > >>> On Wed, Oct 15, 2025 at 6:13 AM Amogh Desai <
> > > > > [email protected]
> > > > > > >
> > > > > > > > wrote:
> > > > > > > > >>>> Thanks Pavan and Kaxil, seems like an interesting idea
> > and a
> > > > > > pretty
> > > > > > > > >>>> reasonable problem to solve.
> > > > > > > > >>>>
> > > > > > > > >>>> I also like the idea of starting with
> > > > > > > > >>> `apache-airflow-providers-common-ai`
> > > > > > > > >>>> and expanding as / when needed.
> > > > > > > > >>>>
> > > > > > > > >>>> Looking forward to when the recording will be out,
> missed
> > > > > > attending
> > > > > > > > this
> > > > > > > > >>>> session at the Airflow Summit.
> > > > > > > > >>>>
> > > > > > > > >>>> Thanks & Regards,
> > > > > > > > >>>> Amogh Desai
> > > > > > > > >>>>
> > > > > > > > >>>>
> > > > > > > > >>>> On Thu, Oct 9, 2025 at 10:49 AM Kaxil Naik <
> > [email protected]
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >>>>
> > > > > > > > >>>>> Yea I think it should be
> > apache-airflow-providers-common-ai
> > > > > > > > >>>>>
> > > > > > > > >>>>> On Wed, 8 Oct 2025 at 02:04, Pavankumar Gopidesu <
> > > > > > > > >>> [email protected]>
> > > > > > > > >>>>> wrote:
> > > > > > > > >>>>>
> > > > > > > > >>>>>> Yes its new provider starting with completely
> > experimental, we
> > > > > > dont
> > > > > > > > >>>>>> want to break functionalities with existing providers
> :)
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Mostly its sql based operators, so named it as sql-ai
> > but
> > > > > agree
> > > > > > we
> > > > > > > > >>> can
> > > > > > > > >>>>>> make it generic without specifying sql in it :)
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> Pavan
> > > > > > > > >>>>>>
> > > > > > > > >>>>>> On Tue, Oct 7, 2025 at 3:48 PM Ryan Hatter via dev
> > > > > > > > >>>>>> <[email protected]> wrote:
> > > > > > > > >>>>>>> Would this really necessitate a new provider? Should
> > this
> > > > > just
> > > > > > be
> > > > > > > > >>> baked
> > > > > > > > >>>>>>> into the common SQL provider?
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> Alternatively, instead of a narrow `sql-ai` provider,
> > why not
> > > > > > have
> > > > > > > > >>> a
> > > > > > > > >>>>>>> generic common ai provider with a SQL package, which
> > would
> > > > > > allow
> > > > > > > > >>> for us
> > > > > > > > >>>>>> to
> > > > > > > > >>>>>>> build AI-based subpackages into the provider other
> > than just
> > > > > > SQL?
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>> On Mon, Oct 6, 2025 at 4:31 PM Pavankumar Gopidesu <
> > > > > > > > >>>>>> [email protected]>
> > > > > > > > >>>>>>> wrote:
> > > > > > > > >>>>>>>
> > > > > > > > >>>>>>>> @Giorgio Yes indeed that's also a good thought to
> > > > > integrate. I
> > > > > > > > >>> will
> > > > > > > > >>>>>> keep in
> > > > > > > > >>>>>>>> mind to think about when I draft AIP and message
> > about this
> > > > > a
> > > > > > bit
> > > > > > > > >>>>> more
> > > > > > > > >>>>>> :)
> > > > > > > > >>>>>>>> Yes please join. We have great demos packed on this
> > topic :)
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> @kaxil , Yes that's a great blog post from the wren
> > AI and
> > > > > > > > >>> leveraging
> > > > > > > > >>>>>> the
> > > > > > > > >>>>>>>> Apache DataFusion as a query engine to connect to
> > different
> > > > > > data
> > > > > > > > >>>>>> sources.
> > > > > > > > >>>>>>>> Pavan
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>> On Tue, Sep 30, 2025 at 7:37 PM Giorgio Zoppi <
> > > > > > > > >>>>> [email protected]
> > > > > > > > >>>>>>>> wrote:
> > > > > > > > >>>>>>>>
> > > > > > > > >>>>>>>>> Hey Pavan,
> > > > > > > > >>>>>>>>> Some notes:
> > > > > > > > >>>>>>>>> 1. LLM can be also very useful in detecting root
> > causes of
> > > > > > your
> > > > > > > > >>>>> error
> > > > > > > > >>>>>>>> while
> > > > > > > > >>>>>>>>> developing and design a pipeline. I explain me
> > better, we'd
> > > > > > in
> > > > > > > > >>> the
> > > > > > > > >>>>>> past
> > > > > > > > >>>>>>>>> several
> > > > > > > > >>>>>>>>> Spark processes, when it is all green is ok, but
> > when on
> > > > > > > > >>> fails, it
> > > > > > > > >>>>>> will
> > > > > > > > >>>>>>>> be
> > > > > > > > >>>>>>>>> nice to have a tool integrated to ask why.
> > > > > > > > >>>>>>>>> 2. Ideally such operator could be a
> > > > > > > > >>> ModelContextProtocolOperator
> > > > > > > > >>>>> and
> > > > > > > > >>>>>> you
> > > > > > > > >>>>>>>>> would not need nothing else that put an LLM as
> > parameter
> > > > > with
> > > > > > > > >>> that
> > > > > > > > >>>>>>>>> operator,
> > > > > > > > >>>>>>>>> and just call for tools, execute query, and so on.
> > This
> > > > > would
> > > > > > > > >>> be
> > > > > > > > >>>>> more
> > > > > > > > >>>>>>>>> powerful, because you create an abstraction between
> > > > > devices,
> > > > > > > > >>>>>> databases,
> > > > > > > > >>>>>>>>> server and so on, so each source of data can be
> > injected on
> > > > > > the
> > > > > > > > >>>>>> pipeline.
> > > > > > > > >>>>>>>>> 3.  Good job! Looking forward to see the
> > presentation.
> > > > > > > > >>>>>>>>> Best Regards,
> > > > > > > > >>>>>>>>> Giorgio
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> Il giorno mar 30 set 2025 alle ore 14:51 Pavankumar
> > > > > Gopidesu
> > > > > > <
> > > > > > > > >>>>>>>>> [email protected]> ha scritto:
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>>> Hi everyone,
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> We're exploring adding LLM-powered SQL operators
> to
> > > > > Airflow
> > > > > > > > >>> and
> > > > > > > > >>>>>> would
> > > > > > > > >>>>>>>>> love
> > > > > > > > >>>>>>>>>> community input before writing an AIP.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> The idea: Let users write natural language prompts
> > like
> > > > > > "find
> > > > > > > > >>>>>> customers
> > > > > > > > >>>>>>>>>> with missing emails" and have Airflow generate
> safe
> > SQL
> > > > > > > > >>> queries
> > > > > > > > >>>>>> with
> > > > > > > > >>>>>>>> full
> > > > > > > > >>>>>>>>>> context about your database schema, connections,
> > and data
> > > > > > > > >>>>>> sensitivity.
> > > > > > > > >>>>>>>>>> Why this matters:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Most of us spend too much time on schema drift
> > detection
> > > > > and
> > > > > > > > >>>>> manual
> > > > > > > > >>>>>>>> data
> > > > > > > > >>>>>>>>>> quality checks. Meanwhile, AI agents are getting
> > powerful
> > > > > > but
> > > > > > > > >>>>> lack
> > > > > > > > >>>>>>>>>> production-ready data integrations. Airflow could
> > bridge
> > > > > > this
> > > > > > > > >>>>> gap.
> > > > > > > > >>>>>>>>>> Here's what we're dealing with at Tavant:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Our team works with multiple data domain teams
> > producing
> > > > > > > > >>> data in
> > > > > > > > >>>>>>>>> different
> > > > > > > > >>>>>>>>>> formats and storage across S3, PostgreSQL,
> Iceberg,
> > and
> > > > > > > > >>> Aurora.
> > > > > > > > >>>>>> When
> > > > > > > > >>>>>>>> data
> > > > > > > > >>>>>>>>>> assets become available for consumption, we need:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Detection of breaking schema changes between
> > systems
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Data quality assessments between snapshots
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Validation that assets meet mandatory metadata
> > > > > > requirements
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Lookup validation against existing data
> > (comparing file
> > > > > > > > >>> feeds
> > > > > > > > >>>>>> with
> > > > > > > > >>>>>>>>>> different formats to existing data in
> > Iceberg/Aurora)
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> This is exactly the type of work that LLMs  could
> > automate
> > > > > > > > >>> while
> > > > > > > > >>>>>>>>>> maintaining governance.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> What we're thinking:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> ```python
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> # Instead of writing complex SQL by hand...
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> quality_check = LLMSQLQueryOperator(
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>      task_id="find_data_issues",
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>      prompt="Find customers with invalid email
> > formats and
> > > > > > > > >>> missing
> > > > > > > > >>>>>> phone
> > > > > > > > >>>>>>>>>> numbers",
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>      data_sources=[customer_asset],  # Airflow
> > knows the
> > > > > > > > >>> schema
> > > > > > > > >>>>>>>>>> automatically
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>      # Built-in safety: won't generate DROP/DELETE
> > > > > > statements
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> )
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> ```
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> The operator would:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Auto-inject database schema, sample data, and
> > connection
> > > > > > > > >>>>> details
> > > > > > > > >>>>>>>>>> - Generate safe SQL (blocks dangerous operations)
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Work across PostgreSQL, Snowflake, BigQuery with
> > dialect
> > > > > > > > >>>>>> awareness
> > > > > > > > >>>>>>>>>> - Support schema drift detection between systems
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> - Handle multi-cloud data via Apache DataFusion[1]
> > (Did
> > > > > some
> > > > > > > > >>>>>>>> experiments
> > > > > > > > >>>>>>>>>> with 50M+          records and results are in
> 10-15
> > > > > seconds
> > > > > > > > >>> for
> > > > > > > > >>>>>> common
> > > > > > > > >>>>>>>>>> aggregations)
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> for more info on benchmarks [2]
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Key benefit: Assets become smarter with structured
> > > > > metadata
> > > > > > > > >>>>>> (schema,
> > > > > > > > >>>>>>>>>> sensitivity, format) instead of just throwing
> > everything
> > > > > in
> > > > > > > > >>>>>> `extra`.
> > > > > > > > >>>>>>>>>> Implementation plan:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Start with a separate provider
> > > > > > > > >>>>> (`apache-airflow-providers-sql-ai`)
> > > > > > > > >>>>>> so
> > > > > > > > >>>>>>>> we
> > > > > > > > >>>>>>>>>> can iterate without touching the Airflow core. No
> > breaking
> > > > > > > > >>>>> changes,
> > > > > > > > >>>>>>>> works
> > > > > > > > >>>>>>>>>> with existing connections and hooks.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> I am presenting this at Airflow Summit 2025 in
> > Seattle
> > > > > with
> > > > > > > > >>>>> Kaxil -
> > > > > > > > >>>>>>>> come
> > > > > > > > >>>>>>>>>> see the live demo!
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Next steps:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> If this resonates after the Summit, we'll write a
> > proper
> > > > > AIP
> > > > > > > > >>> with
> > > > > > > > >>>>>>>>> technical
> > > > > > > > >>>>>>>>>> details and further build a working prototype.
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Thoughts? Concerns? Better ideas?
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> [1]: https://datafusion.apache.org/
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> [2]:
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>
> > > > > > > >
> > > > > >
> > > > >
> >
> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/
> > > > > > > > >>>>>>>>>> Thanks,
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> Pavan
> > > > > > > > >>>>>>>>>>
> > > > > > > > >>>>>>>>>> P.S. - Happy to share more technical details with
> > anyone
> > > > > > > > >>>>>> interested.
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>>>> --
> > > > > > > > >>>>>>>>> Life is a chess game - Anonymous.
> > > > > > > > >>>>>>>>>
> > > > > > > > >>>>>>
> > > > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > >>>>>> To unsubscribe, e-mail:
> > [email protected]
> > > > > > > > >>>>>> For additional commands, e-mail:
> > [email protected]
> > > > > > > > >>>>>>
> > > > > > > > >>>>>>
> > > > > > > > >>>
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > >>> To unsubscribe, e-mail:
> [email protected]
> > > > > > > > >>> For additional commands, e-mail:
> > [email protected]
> > > > > > > > >>>
> > > > > > > > >>>
> > > > > > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > > For additional commands, e-mail:
> [email protected]
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > ---------------------------------------------------------------------
> > > > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > > > For additional commands, e-mail: [email protected]
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > ---------------------------------------------------------------------
> > > > > > To unsubscribe, e-mail: [email protected]
> > > > > > For additional commands, e-mail: [email protected]
> > > > > >
> > > > > >
> > > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: [email protected]
> > > For additional commands, e-mail: [email protected]
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>

Re: AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows

Reply via email to