+1 We have many AWS credits left in our community Airflow account that could be used for that kind of testing.
________________________________ From: Jarek Potiuk <[email protected]> Sent: Tuesday, January 13, 2026 10:59:46 AM To: [email protected] Subject: RE: [EXT] AI-Native Airflow - LLM-Driven Intelligence for Production Data Workflows CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe. AVERTISSEMENT: Ce courrier électronique provient d’un expéditeur externe. Ne cliquez sur aucun lien et n’ouvrez aucune pièce jointe si vous ne pouvez pas confirmer l’identité de l’expéditeur et si vous n’êtes pas certain que le contenu ne présente aucun risque. FYI: Legally it is fine. Financially - I think we could get a donation for that if we asked :) On Tue, Jan 13, 2026 at 6:36 PM Shahar Epstein <[email protected]> wrote: > Great idea Pavan, and I would really love to see it happening! > > One thing that I'm quite concerned about - how are we going to test, > evaluate, and assure the quality of the system prompts of these operators? > Given that currently we cannot officially use AI in our CI to do all of > that (legally & financially, AFAIK), I do not feel comfortable delivering > the system prompts out of the box, but rather let the user to define them > explictly instead. We could recommend in the docs on prompts based on the > community's experience - but in any case I think that it should be required > field with a clear disclaimer that the user is fully responsible for the > system prompt. > > > Shahar > > > On Tue, Sep 30, 2025, 16:51 Pavankumar Gopidesu <[email protected]> > wrote: > > > Hi everyone, > > > > We're exploring adding LLM-powered SQL operators to Airflow and would > love > > community input before writing an AIP. > > > > The idea: Let users write natural language prompts like "find customers > > with missing emails" and have Airflow generate safe SQL queries with full > > context about your database schema, connections, and data sensitivity. > > > > Why this matters: > > > > > > Most of us spend too much time on schema drift detection and manual data > > quality checks. Meanwhile, AI agents are getting powerful but lack > > production-ready data integrations. Airflow could bridge this gap. > > > > Here's what we're dealing with at Tavant: > > > > > > Our team works with multiple data domain teams producing data in > different > > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When data > > assets become available for consumption, we need: > > > > - Detection of breaking schema changes between systems > > > > - Data quality assessments between snapshots > > > > - Validation that assets meet mandatory metadata requirements > > > > - Lookup validation against existing data (comparing file feeds with > > different formats to existing data in Iceberg/Aurora) > > > > This is exactly the type of work that LLMs could automate while > > maintaining governance. > > > > What we're thinking: > > > > ```python > > > > # Instead of writing complex SQL by hand... > > > > quality_check = LLMSQLQueryOperator( > > > > task_id="find_data_issues", > > > > prompt="Find customers with invalid email formats and missing phone > > numbers", > > > > data_sources=[customer_asset], # Airflow knows the schema > > automatically > > > > # Built-in safety: won't generate DROP/DELETE statements > > > > ) > > > > ``` > > > > The operator would: > > > > - Auto-inject database schema, sample data, and connection details > > > > - Generate safe SQL (blocks dangerous operations) > > > > - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness > > > > - Support schema drift detection between systems > > > > - Handle multi-cloud data via Apache DataFusion[1] (Did some experiments > > with 50M+ records and results are in 10-15 seconds for common > > aggregations) > > > > for more info on benchmarks [2] > > > > Key benefit: Assets become smarter with structured metadata (schema, > > sensitivity, format) instead of just throwing everything in `extra`. > > > > Implementation plan: > > > > Start with a separate provider (`apache-airflow-providers-sql-ai`) so we > > can iterate without touching the Airflow core. No breaking changes, works > > with existing connections and hooks. > > > > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - come > > see the live demo! > > > > Next steps: > > > > If this resonates after the Summit, we'll write a proper AIP with > technical > > details and further build a working prototype. > > > > Thoughts? Concerns? Better ideas? > > > > > > [1]: https://datafusion.apache.org/ > > > > [2]: > > > > > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ > > > > Thanks, > > > > Pavan > > > > P.S. - Happy to share more technical details with anyone interested. > > >
