Looks like even Wren did something similar with Datafusion: https://getwren.ai/post/powering-semantic-sql-for-ai-agents-with-apache-datafusion - so there is definitely an interest
On Tue, 30 Sept 2025 at 16:19, Kaxil Naik <[email protected]> wrote: > Love the idea Pavan! Looking forward to hearing more from you at the > Summit and speaking "a little" while at it. > > On Tue, 30 Sept 2025 at 14:51, Pavankumar Gopidesu < > [email protected]> wrote: > >> Hi everyone, >> >> We're exploring adding LLM-powered SQL operators to Airflow and would love >> community input before writing an AIP. >> >> The idea: Let users write natural language prompts like "find customers >> with missing emails" and have Airflow generate safe SQL queries with full >> context about your database schema, connections, and data sensitivity. >> >> Why this matters: >> >> >> Most of us spend too much time on schema drift detection and manual data >> quality checks. Meanwhile, AI agents are getting powerful but lack >> production-ready data integrations. Airflow could bridge this gap. >> >> Here's what we're dealing with at Tavant: >> >> >> Our team works with multiple data domain teams producing data in different >> formats and storage across S3, PostgreSQL, Iceberg, and Aurora. When data >> assets become available for consumption, we need: >> >> - Detection of breaking schema changes between systems >> >> - Data quality assessments between snapshots >> >> - Validation that assets meet mandatory metadata requirements >> >> - Lookup validation against existing data (comparing file feeds with >> different formats to existing data in Iceberg/Aurora) >> >> This is exactly the type of work that LLMs could automate while >> maintaining governance. >> >> What we're thinking: >> >> ```python >> >> # Instead of writing complex SQL by hand... >> >> quality_check = LLMSQLQueryOperator( >> >> task_id="find_data_issues", >> >> prompt="Find customers with invalid email formats and missing phone >> numbers", >> >> data_sources=[customer_asset], # Airflow knows the schema >> automatically >> >> # Built-in safety: won't generate DROP/DELETE statements >> >> ) >> >> ``` >> >> The operator would: >> >> - Auto-inject database schema, sample data, and connection details >> >> - Generate safe SQL (blocks dangerous operations) >> >> - Work across PostgreSQL, Snowflake, BigQuery with dialect awareness >> >> - Support schema drift detection between systems >> >> - Handle multi-cloud data via Apache DataFusion[1] (Did some experiments >> with 50M+ records and results are in 10-15 seconds for common >> aggregations) >> >> for more info on benchmarks [2] >> >> Key benefit: Assets become smarter with structured metadata (schema, >> sensitivity, format) instead of just throwing everything in `extra`. >> >> Implementation plan: >> >> Start with a separate provider (`apache-airflow-providers-sql-ai`) so we >> can iterate without touching the Airflow core. No breaking changes, works >> with existing connections and hooks. >> >> I am presenting this at Airflow Summit 2025 in Seattle with Kaxil - come >> see the live demo! >> >> Next steps: >> >> If this resonates after the Summit, we'll write a proper AIP with >> technical >> details and further build a working prototype. >> >> Thoughts? Concerns? Better ideas? >> >> >> [1]: https://datafusion.apache.org/ >> >> [2]: >> >> https://datafusion.apache.org/blog/2024/11/18/datafusion-fastest-single-node-parquet-clickbench/ >> >> Thanks, >> >> Pavan >> >> P.S. - Happy to share more technical details with anyone interested. >> >
