cetingokhan commented on PR #62963:
URL: https://github.com/apache/airflow/pull/62963#issuecomment-4187884616

   > I like the overall direction here, but I wonder if the long-term authoring 
model should be more config-driven than Python-driven.
   > 
   > Instead of requiring DAG authors to define many checks as Python 
validators/factories, could the LLM generate a simple JSON/YAML rule spec from 
the natural-language prompts, and then have the operator execute that spec with 
a deterministic engine?
   > 
   > An example qualink https://github.com/gopidesupavan/qualink and 
https://gopidesupavan.github.io/qualink/guide/yaml-config/
   > 
   > Not sure similarly may be great expection may have something like that..
   > 
   > So that based on the prompts llm generates rules and we can simply execute 
on them?
   > 
   > I do think a config-first layer for common checks may scale better than 
growing the Python validator API. Python validators could still remain as an 
escape hatch for advanced/custom cases.
   > 
   > If you have any other tools may be please suggest something config driven 
much better for llms easily generate rules and we execute them.. WDYT?
   > 
   > On side note at my org we started discussion to use `qualink` ( i have 
developed that 😄 so not insisting that to use ) it performs nicely llm 
generates rules and can be executed directly on the object stores..
   
   Hi @gopidesupavan,
   
   Thanks for the feedback! I’ve spent some time thinking about the 
config-driven approach versus the current implementation. I’ve checked out 
Qualink as well—congratulations on that, using DataFusion is a very smart move 
for performance.
   
   Regarding the operator's direction, I actually started developing this by 
following your comment on the AIP-99 proposal, where you mentioned that users 
would express requirements in a prompt, and we would generate queries, 
translate them into internal rules, and execute them. That’s why I included the 
execution logic directly—to provide that seamless, all-in-one experience. 
   
   In my previous LLM-based DQ projects, I’ve found that forcing an LLM to 
generate complex YAML or JSON schemas often leads to unexpected hallucinations. 
Moreover, while dealing with large-scale data, I had to rely on Spark 
Dataframes via Great Expectations (GEX) to handle the load, and as the rule set 
grows, it becomes increasingly difficult for the LLM to map natural language to 
the exact parameters of a specific tool's config. On the other hand, generating 
SQL is generally more deterministic and reliable for the LLM to handle 
consistently across different scales.
   
   I believe this operator should stay lean and accessible as a "first-step" 
tool. By focusing on SQL generation, we keep it engine-agnostic and reduce the 
barrier to entry for users who don't want to manage a secondary DQ platform 
immediately. If a user’s needs grow to an enterprise scale where they require 
complex rule management, they might be better served by a dedicated provider or 
a specialized engine. 
   
   For this specific PR, my suggestion is to keep the initial version simple 
and focused on high-reliability SQL generation, aligning with the execution 
flow proposed in AIP-99. We could potentially add an "export" feature later to 
bridge it with tools like Qualink, SODA or GEX, but I think starting with a 
deterministic, SQL-first approach will provide a much more stable experience 
for Airflow users right out of the box, allowing Airflow users to meet most of 
their needs directly while enabling them to easily add minor custom check 
extensions for simpler requirements. 
   
   I'm ready to pivot the development based on your final call. Since you are 
leading the direction here, I'll follow your guidance—please let me know how 
you’d like me to proceed, and I’ll be happy to implement it quickly. :)
   
   Regarding Qualink, I actually think it has great potential for big data 
scales; for instance, integrating DataFusion-Comet could be a game-changer for 
Spark environments, and I’d honestly love to support you on that front in the 
future!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to