Great to hear, Jarek and Niko! If so, then let's roll. Maybe it's a good time for me to re-instantiate the MCP AIP as well :)
Shahar On Tue, Jan 13, 2026 at 9:00 PM Jarek Potiuk <[email protected]> wrote: > FYI: Legally it is fine. Financially - I think we could get a donation for > that if we asked :) > > On Tue, Jan 13, 2026 at 6:36 PM Shahar Epstein <[email protected]> wrote: > > > Great idea Pavan, and I would really love to see it happening! > > > > One thing that I'm quite concerned about - how are we going to test, > > evaluate, and assure the quality of the system prompts of these > operators? > > Given that currently we cannot officially use AI in our CI to do all of > > that (legally & financially, AFAIK), I do not feel comfortable delivering > > the system prompts out of the box, but rather let the user to define them > > explictly instead. We could recommend in the docs on prompts based on the > > community's experience - but in any case I think that it should be > required > > field with a clear disclaimer that the user is fully responsible for the > > system prompt. > > > > > > Shahar > > > > > > On Tue, Sep 30, 2025, 16:51 Pavankumar Gopidesu <[email protected] > > > > wrote: > > > > > Hi everyone, > > > > > > We're exploring adding LLM-powered SQL operators to Airflow and would > > love > > > community input before writing an AIP. > > > > > > The idea: Let users write natural language prompts like "find customers > > > with missing emails" and have Airflow generate safe SQL queries with > full > > > context about your database schema, connections, and data sensitivity. > > > > > > Why this matters: > > > > > > > > > Most of us spend too much time on schema drift detection and manual > data > > > quality checks. Meanwhile, AI agents are getting powerful but lack > > > production-ready data integrations. Airflow could bridge this gap. > > > > > > Here's what we're dealing with at Tavant: > > > > > > > > > Our team works with multiple data domain teams producing data in > > different > > > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When > data > > > assets become available for consumption, we need: > > > > > > - Detection of breaking schema changes between systems > > > > > > - Data quality assessments between snapshots > > > > > > - Validation that assets meet mandatory metadata requirements > > > > > > - Lookup validation against existing data (comparing file feeds with > > > different formats to existing data in Iceberg/Aurora) > > > > > > This is exactly the type of work that LLMs could automate while > > > maintaining governance. > > > > > > What we're thinking: > > > > > > ```python > > > > > > # Instead of writing complex SQL by hand... > > > > > > quality_check = LLMSQLQueryOperator( > > > > > > task_id="find_data_issues", > > > > > > prompt="Find customers with invalid email formats and missing phone > > > numbers", > > > > > > data_sources=[customer_asset], # Airflow knows the schema > > > automatically > > > > > > # Built-in safety: won't generate DROP/DELETE statements > > > > > > ) > > > > > > ``` > > > > > > The operator would: > > > > > > - Auto-inject database schema, sample data, and connection details > > > > > > - Generate safe SQL (blocks dangerous operations) > > > > > > - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness > > > > > > - Support schema drift detection between systems > > > > > > - Handle multi-cloud data via Apache DataFusion[1] (Did some > experiments > > > with 50M+ records and results are in 10-15 seconds for common > > > aggregations) > > > > > > for more info on benchmarks [2] > > > > > > Key benefit: Assets become smarter with structured metadata (schema, > > > sensitivity, format) instead of just throwing everything in `extra`. > > > > > > Implementation plan: > > > > > > Start with a separate provider (`apache-airflow-providers-sql-ai`) so > we > > > can iterate without touching the Airflow core. No breaking changes, > works > > > with existing connections and hooks. > > > > > > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - > come > > > see the live demo! > > > > > > Next steps: > > > > > > If this resonates after the Summit, we'll write a proper AIP with > > technical > > > details and further build a working prototype. > > > > > > Thoughts? Concerns? Better ideas? > > > > > > > > > [1]: https://datafusion.apache.org/ > > > > > > [2]: > > > > > > > > > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ > > > > > > Thanks, > > > > > > Pavan > > > > > > P.S. - Happy to share more technical details with anyone interested. > > > > > >
