cetingokhan commented on PR #62963:
URL: https://github.com/apache/airflow/pull/62963#issuecomment-4176546929
Hi @kaxil and @gopidesupavan ,
Could I get your comments on the scope, the current status of which I've
shared below?
## Final Scopes
**Plan generation & caching**
- Takes a prompts dict where each key is a check name and each value is a
plain-language description
- Sends all prompts and the target schema to the LLM in one call; the LLM
groups related checks into optimised SQL queries to minimise round-trips
- Serialises the generated plan to an Airflow Variable keyed by a hash of
prompts + prompt_version + collect_unexpected — repeat runs skip the LLM call
entirely
**Built-in validators**
- Factory functions like null_pct_check, row_count_check, unique_pct_check
are ready to use from dq_validation
- Each factory returns a callable that receives the raw scalar metric from
the SQL result and returns True (pass) or False (fail)
- Plain lambdas also work: "row_check": lambda v: v >= 1000
**Custom aggregate validators with register_validator**
- @register_validator("my_check") registers a factory function in the global
registry
- The registered validator works the same way as built-ins: receives a raw
scalar from the SQL SELECT column and returns a bool
- Useful for domain-specific aggregate thresholds that do not fit the
built-in factories
**Custom row-level validators with register_validator(row_level=True)**
- @register_validator("my_check", row_level=True) marks the validator as
row-level
- The LLM generates a plain SELECT for the column instead of an aggregate
query
- Each individual row value is passed to the validator callable; Airflow
counts how many rows fail
- Result is returned as a RowLevelResult with total, invalid, invalid_pct,
and sample_violations
- Pass/fail is decided by comparing invalid_pct against the validator's
_max_invalid_pct attribute
**dry_run mode**
- Set dry_run=True to generate and cache the plan without running any SQL
- Only returns the serialised plan dict immediately without any control
results
**require_approval — HITL integration**
- Set require_approval=True to gate SQL execution on human review
- After plan generation the task defers, surfacing the full plan as a
structured markdown review body in the HITL interface
- SQL checks run only after the reviewer approves; rejection raises
HITLRejectException
- dry_run=True takes precedence — combining both flags returns the plan dict
immediately without requesting approval???
**Unexpected row collection**
- Set collect_unexpected=True to have the LLM also generate a detail query
alongside each validity/format check
- When a check fails the detail query runs and attaches up to
unexpected_sample_size violating rows to the report
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]