cetingokhan commented on PR #62963: URL: https://github.com/apache/airflow/pull/62963#issuecomment-4187884616
> I like the overall direction here, but I wonder if the long-term authoring model should be more config-driven than Python-driven. > > Instead of requiring DAG authors to define many checks as Python validators/factories, could the LLM generate a simple JSON/YAML rule spec from the natural-language prompts, and then have the operator execute that spec with a deterministic engine? > > An example qualink https://github.com/gopidesupavan/qualink and https://gopidesupavan.github.io/qualink/guide/yaml-config/ > > Not sure similarly may be great expection may have something like that.. > > So that based on the prompts llm generates rules and we can simply execute on them? > > I do think a config-first layer for common checks may scale better than growing the Python validator API. Python validators could still remain as an escape hatch for advanced/custom cases. > > If you have any other tools may be please suggest something config driven much better for llms easily generate rules and we execute them.. WDYT? > > On side note at my org we started discussion to use `qualink` ( i have developed that 😄 so not insisting that to use ) it performs nicely llm generates rules and can be executed directly on the object stores.. Hi @gopidesupavan, Thanks for the feedback! I’ve spent some time thinking about the config-driven approach versus the current implementation. I’ve checked out Qualink as well—congratulations on that, using DataFusion is a very smart move for performance. Regarding the operator's direction, I actually started developing this by following your comment on the AIP-99 proposal, where you mentioned that users would express requirements in a prompt, and we would generate queries, translate them into internal rules, and execute them. That’s why I included the execution logic directly—to provide that seamless, all-in-one experience. In my previous LLM-based DQ projects, I’ve found that forcing an LLM to generate complex YAML or JSON schemas often leads to unexpected hallucinations. Moreover, while dealing with large-scale data, I had to rely on Spark Dataframes via Great Expectations (GEX) to handle the load, and as the rule set grows, it becomes increasingly difficult for the LLM to map natural language to the exact parameters of a specific tool's config. On the other hand, generating SQL is generally more deterministic and reliable for the LLM to handle consistently across different scales. I believe this operator should stay lean and accessible as a "first-step" tool. By focusing on SQL generation, we keep it engine-agnostic and reduce the barrier to entry for users who don't want to manage a secondary DQ platform immediately. If a user’s needs grow to an enterprise scale where they require complex rule management, they might be better served by a dedicated provider or a specialized engine. For this specific PR, my suggestion is to keep the initial version simple and focused on high-reliability SQL generation, aligning with the execution flow proposed in AIP-99. We could potentially add an "export" feature later to bridge it with tools like Qualink, SODA or GEX, but I think starting with a deterministic, SQL-first approach will provide a much more stable experience for Airflow users right out of the box, allowing Airflow users to meet most of their needs directly while enabling them to easily add minor custom check extensions for simpler requirements. I'm ready to pivot the development based on your final call. Since you are leading the direction here, I'll follow your guidance—please let me know how you’d like me to proceed, and I’ll be happy to implement it quickly. :) Regarding Qualink, I actually think it has great potential for big data scales; for instance, integrating DataFusion-Comet could be a game-changer for Spark environments, and I’d honestly love to support you on that front in the future! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
