Re: [PR] Aip 99 llmdataqualityoperator [airflow]

via GitHub Thu, 02 Apr 2026 04:05:40 -0700


cetingokhan commented on PR #62963:
URL: https://github.com/apache/airflow/pull/62963#issuecomment-4176546929


   Hi @kaxil and @gopidesupavan ,
   Could I get your comments on the scope, the current status of which I've 
shared below?
   
   
   
   ## Final Scopes
   
   **Plan generation & caching**
   
   - Takes a prompts dict where each key is a check name and each value is a 
plain-language description
   - Sends all prompts and the target schema to the LLM in one call; the LLM 
groups related checks into optimised SQL queries to minimise round-trips
   - Serialises the generated plan to an Airflow Variable keyed by a hash of 
prompts + prompt_version + collect_unexpected — repeat runs skip the LLM call 
entirely
   
   **Built-in validators**
   
   - Factory functions like null_pct_check, row_count_check, unique_pct_check 
are ready to use from dq_validation
   - Each factory returns a callable that receives the raw scalar metric from 
the SQL result and returns True (pass) or False (fail)
   - Plain lambdas also work: "row_check": lambda v: v >= 1000
   
   **Custom aggregate validators with register_validator**
   
   - @register_validator("my_check") registers a factory function in the global 
registry
   - The registered validator works the same way as built-ins: receives a raw 
scalar from the SQL SELECT column and returns a bool
   - Useful for domain-specific aggregate thresholds that do not fit the 
built-in factories
   
   
   **Custom row-level validators with register_validator(row_level=True)**
   
   - @register_validator("my_check", row_level=True) marks the validator as 
row-level
   - The LLM generates a plain SELECT for the column instead of an aggregate 
query
   - Each individual row value is passed to the validator callable; Airflow 
counts how many rows fail
   - Result is returned as a RowLevelResult with total, invalid, invalid_pct, 
and sample_violations
   - Pass/fail is decided by comparing invalid_pct against the validator's 
_max_invalid_pct attribute
   
   **dry_run mode**
   
   - Set dry_run=True to generate and cache the plan without running any SQL
   - Only returns the serialised plan dict immediately without any control 
results
   
   **require_approval — HITL integration**
   
   - Set require_approval=True to gate SQL execution on human review
   - After plan generation the task defers, surfacing the full plan as a 
structured markdown review body in the HITL interface
   - SQL checks run only after the reviewer approves; rejection raises 
HITLRejectException
   - dry_run=True takes precedence — combining both flags returns the plan dict 
immediately without requesting approval???
   
   **Unexpected row collection**
   
   - Set collect_unexpected=True to have the LLM also generate a detail query 
alongside each validity/format check
   - When a check fails the detail query runs and attaches up to 
unexpected_sample_size violating rows to the report


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Aip 99 llmdataqualityoperator [airflow]

Reply via email to