Thanks shahar for the review, Prompts are provided directly by users as inputs to these operators. We do not supply prompts ourselves; instead, we provide a set of guidelines and examples. Ultimately, it is the user’s responsibility to decide which prompts they use with these operators. Additionally, we will have built-in safety rules as part of the operators, for example: BLOCKED_KEYWORDS = ["DROP", "TRUNCATE", "GRANT", "REVOKE"].
I agree on the testing aspect. agree with Jarek If we receive any sponsor donations, that would be great. As Niko pointed out, we may also have the option to use AWS credits . I’ll raise a discussion once we reach the implementation phase and testing these operators how best we can do. When Kaxil and I (although I couldn't attend the Summit in person 🙂) demoed this idea at last year’s Airflow Summit, I used GitHub Models as part of my POC to showcase some of the functionality, such as query generation. Regards, Pavan On Tue, Jan 13, 2026 at 9:28 PM Shahar Epstein <[email protected]> wrote: > Great to hear, Jarek and Niko! > If so, then let's roll. Maybe it's a good time for me to re-instantiate the > MCP AIP as well :) > > > Shahar > > On Tue, Jan 13, 2026 at 9:00 PM Jarek Potiuk <[email protected]> wrote: > > > FYI: Legally it is fine. Financially - I think we could get a donation > for > > that if we asked :) > > > > On Tue, Jan 13, 2026 at 6:36 PM Shahar Epstein <[email protected]> > wrote: > > > > > Great idea Pavan, and I would really love to see it happening! > > > > > > One thing that I'm quite concerned about - how are we going to test, > > > evaluate, and assure the quality of the system prompts of these > > operators? > > > Given that currently we cannot officially use AI in our CI to do all of > > > that (legally & financially, AFAIK), I do not feel comfortable > delivering > > > the system prompts out of the box, but rather let the user to define > them > > > explictly instead. We could recommend in the docs on prompts based on > the > > > community's experience - but in any case I think that it should be > > required > > > field with a clear disclaimer that the user is fully responsible for > the > > > system prompt. > > > > > > > > > Shahar > > > > > > > > > On Tue, Sep 30, 2025, 16:51 Pavankumar Gopidesu < > [email protected] > > > > > > wrote: > > > > > > > Hi everyone, > > > > > > > > We're exploring adding LLM-powered SQL operators to Airflow and would > > > love > > > > community input before writing an AIP. > > > > > > > > The idea: Let users write natural language prompts like "find > customers > > > > with missing emails" and have Airflow generate safe SQL queries with > > full > > > > context about your database schema, connections, and data > sensitivity. > > > > > > > > Why this matters: > > > > > > > > > > > > Most of us spend too much time on schema drift detection and manual > > data > > > > quality checks. Meanwhile, AI agents are getting powerful but lack > > > > production-ready data integrations. Airflow could bridge this gap. > > > > > > > > Here's what we're dealing with at Tavant: > > > > > > > > > > > > Our team works with multiple data domain teams producing data in > > > different > > > > formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When > > data > > > > assets become available for consumption, we need: > > > > > > > > - Detection of breaking schema changes between systems > > > > > > > > - Data quality assessments between snapshots > > > > > > > > - Validation that assets meet mandatory metadata requirements > > > > > > > > - Lookup validation against existing data (comparing file feeds with > > > > different formats to existing data in Iceberg/Aurora) > > > > > > > > This is exactly the type of work that LLMs could automate while > > > > maintaining governance. > > > > > > > > What we're thinking: > > > > > > > > ```python > > > > > > > > # Instead of writing complex SQL by hand... > > > > > > > > quality_check = LLMSQLQueryOperator( > > > > > > > > task_id="find_data_issues", > > > > > > > > prompt="Find customers with invalid email formats and missing > phone > > > > numbers", > > > > > > > > data_sources=[customer_asset], # Airflow knows the schema > > > > automatically > > > > > > > > # Built-in safety: won't generate DROP/DELETE statements > > > > > > > > ) > > > > > > > > ``` > > > > > > > > The operator would: > > > > > > > > - Auto-inject database schema, sample data, and connection details > > > > > > > > - Generate safe SQL (blocks dangerous operations) > > > > > > > > - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness > > > > > > > > - Support schema drift detection between systems > > > > > > > > - Handle multi-cloud data via Apache DataFusion[1] (Did some > > experiments > > > > with 50M+ records and results are in 10-15 seconds for > common > > > > aggregations) > > > > > > > > for more info on benchmarks [2] > > > > > > > > Key benefit: Assets become smarter with structured metadata (schema, > > > > sensitivity, format) instead of just throwing everything in `extra`. > > > > > > > > Implementation plan: > > > > > > > > Start with a separate provider (`apache-airflow-providers-sql-ai`) so > > we > > > > can iterate without touching the Airflow core. No breaking changes, > > works > > > > with existing connections and hooks. > > > > > > > > I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - > > come > > > > see the live demo! > > > > > > > > Next steps: > > > > > > > > If this resonates after the Summit, we'll write a proper AIP with > > > technical > > > > details and further build a working prototype. > > > > > > > > Thoughts? Concerns? Better ideas? > > > > > > > > > > > > [1]: https://datafusion.apache.org/ > > > > > > > > [2]: > > > > > > > > > > > > > > https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ > > > > > > > > Thanks, > > > > > > > > Pavan > > > > > > > > P.S. - Happy to share more technical details with anyone interested. > > > > > > > > > >
