I would break up the validator in 3 PRs with no functionality change: extract SqlQueryScopes and NamespaceBuilder, extract the SqlQueryScopePopulator, and final PR for extracting SqlNodeValidator. Then I would apply functionalchanges on top.
On Fri, Aug 23, 2024 at 2:02 PM Mihai Budiu <mbu...@gmail.com> wrote: > So do you want to refactor the code in two steps, where the first step > would break the validator into multiple classes and keep the exact same > behavior, and in a second step rework the handling of correlation variables? > > That sounds great. If you want to do both at the same time, it may be > harder to review. > > Mihai > > ________________________________ > From: James Starr <jamesst...@gmail.com> > Sent: Friday, August 23, 2024 12:26 PM > To: dev@calcite.apache.org <dev@calcite.apache.org> > Subject: Breaking up SqlValidatorImpl > > I am working on rebasing my for pull request ( > https://github.com/apache/calcite/pull/3042), for CALCITE-5418( > https://issues.apache.org/jira/browse/CALCITE-5418) which stores the > correlated variable ids subquery instead of the RelNodes(join, project, > filter). This change was briefly discussed here( > https://lists.apache.org/thread/fvf35njbddcck9hwcqnnxowl49n4nkth). > > Storing the correlated IDs directly on the subquery provides several > benefits: > > - The RelNode interface would no longer need to expose correlated IDs. > - There are several existing bugs with how correlated IDs are populated > in SqlToRelConverter, particularly for Project and Join operations. > - Rules and other rewrite logic for Project, Filter, and Join would no > longer need to propagate correlated IDs. > - Various edge cases could be removed from SqlToRelConverter. > - The code handling correlated variables would be localized to > subqueries and lateral joins, instead of being spread across the entire > codebase whereever expressions can occur. > > > However, since I started this work, there have been several changes to > SqlValidatorImpl. Although my initial changes to SqlValidatorImpl were not > overly complex, they were time-consuming due to the unwieldy nature of > SqlValidatorImpl. Specifically, I’ve been working on populating a map of > lateral joins to linked lists of lateral scopes. > > SqlValidatorImpl currently performs three main functions: > 1. Building a set of data structures for resolving scopes. > 2. Validates the query and resolving types. This requires scope data > structure generated in 1. > 3. Exposes the scoped and type data structure to SqlToRelConverter. > > Currently, SqlValidatorImpl is 7,921 lines long. The first step consists of > 11 methods, each with up to 9 arguments, that recursively walk the tree. > These methods alone account for nearly 2,000 lines. The data structures for > resolving scopes, along with helper functions/methods, take up about 500 > lines. The remaining validation logic comprises over 4,000 lines. The > combination of recursive methods with high argument counts, modifying the > state of SqlValidatorImpl, all within a single file, makes working with > SqlValidatorImpl cumbersome. > > I propose splitting SqlValidatorImpl into several smaller objects. While > the recursive algorithms with high arguments would still exist, separating > them into their own file would allow for a more focused approach. Here’s > how I suggest breaking it down: > > 1. *SqlValidatorImpl* - A facade to maintain the existing API and > orchestrate the composition of the objects below. > 2. *SqlQueryScopes* - Contains the data structures for mapping SqlNodes > to scopes and types, along with relevant helper methods. > 3. *SqlNodeValidator* - Contains most of the methods currently prefixed > with validate. > 4. *SqlQueryScopePopulator* - Contains the methods prefixed with > register > . > 5. *NamespaceBuilder* - Formalizes an implicit API for overriding > namespace creation. > > There would be a circular reference between NamespaceBuilder and > SqlValidatorImpl, as there already exists a circular reference between > SqlValidatorImpl and all namespace objects. Introducing SqlQueryScopes > would help clarify the interactions between SqlNodeValidator, > SqlQueryScopePopulator, and SqlToRelConverter. Additionally, a set of > factory methods for components 2-5 would be added to SqlValidatorConfig to > allow them to be overridden. > > Looking forward to your thoughts on this approach. > James >