I would break up the validator in 3 PRs with no functionality change:
extract SqlQueryScopes and NamespaceBuilder, extract the
SqlQueryScopePopulator, and final PR for extracting SqlNodeValidator.  Then
I would apply functionalchanges on top.

On Fri, Aug 23, 2024 at 2:02 PM Mihai Budiu <mbu...@gmail.com> wrote:

> So do you want to refactor the code in two steps, where the first step
> would break the validator into multiple classes and keep the exact same
> behavior, and in a second step rework the handling of correlation variables?
>
> That sounds great. If you want to do both at the same time, it may be
> harder to review.
>
> Mihai
>
> ________________________________
> From: James Starr <jamesst...@gmail.com>
> Sent: Friday, August 23, 2024 12:26 PM
> To: dev@calcite.apache.org <dev@calcite.apache.org>
> Subject: Breaking up SqlValidatorImpl
>
> I am working on rebasing my for pull request (
> https://github.com/apache/calcite/pull/3042), for CALCITE-5418(
> https://issues.apache.org/jira/browse/CALCITE-5418) which stores the
> correlated variable ids subquery instead of the RelNodes(join, project,
> filter).  This change was briefly discussed here(
> https://lists.apache.org/thread/fvf35njbddcck9hwcqnnxowl49n4nkth).
>
> Storing the correlated IDs directly on the subquery provides several
> benefits:
>
>    - The RelNode interface would no longer need to expose correlated IDs.
>    - There are several existing bugs with how correlated IDs are populated
>    in SqlToRelConverter, particularly for Project and Join operations.
>    - Rules and other rewrite logic for Project, Filter, and Join would no
>    longer need to propagate correlated IDs.
>    - Various edge cases could be removed from SqlToRelConverter.
>    - The code handling correlated variables would be localized to
>    subqueries and lateral joins, instead of being spread across the entire
>    codebase whereever expressions can occur.
>
>
> However, since I started this work, there have been several changes to
> SqlValidatorImpl. Although my initial changes to SqlValidatorImpl were not
> overly complex, they were time-consuming due to the unwieldy nature of
> SqlValidatorImpl. Specifically, I’ve been working on populating a map of
> lateral joins to linked lists of lateral scopes.
>
> SqlValidatorImpl currently performs three main functions:
> 1.  Building a set of data structures for resolving scopes.
> 2.  Validates the query and resolving types.  This requires scope data
> structure generated in 1.
> 3.  Exposes the scoped and type data structure to SqlToRelConverter.
>
> Currently, SqlValidatorImpl is 7,921 lines long. The first step consists of
> 11 methods, each with up to 9 arguments, that recursively walk the tree.
> These methods alone account for nearly 2,000 lines. The data structures for
> resolving scopes, along with helper functions/methods, take up about 500
> lines. The remaining validation logic comprises over 4,000 lines. The
> combination of recursive methods with high argument counts, modifying the
> state of SqlValidatorImpl, all within a single file, makes working with
> SqlValidatorImpl cumbersome.
>
> I propose splitting SqlValidatorImpl into several smaller objects. While
> the recursive algorithms with high arguments would still exist, separating
> them into their own file would allow for a more focused approach. Here’s
> how I suggest breaking it down:
>
>    1. *SqlValidatorImpl* - A facade to maintain the existing API and
>    orchestrate the composition of the objects below.
>    2. *SqlQueryScopes* - Contains the data structures for mapping SqlNodes
>    to scopes and types, along with relevant helper methods.
>    3. *SqlNodeValidator* - Contains most of the methods currently prefixed
>    with validate.
>    4. *SqlQueryScopePopulator* - Contains the methods prefixed with
> register
>    .
>    5. *NamespaceBuilder* - Formalizes an implicit API for overriding
>    namespace creation.
>
> There would be a circular reference between NamespaceBuilder and
> SqlValidatorImpl, as there already exists a circular reference between
> SqlValidatorImpl and all namespace objects. Introducing SqlQueryScopes
> would help clarify the interactions between SqlNodeValidator,
> SqlQueryScopePopulator, and SqlToRelConverter. Additionally, a set of
> factory methods for components 2-5 would be added to SqlValidatorConfig to
> allow them to be overridden.
>
> Looking forward to your thoughts on this approach.
> James
>

Reply via email to