Re: Breaking up SqlValidatorImpl

Mihai Budiu Fri, 23 Aug 2024 17:42:31 -0700

One thing I find unpleasant about the current validator implementation is that 
it leaves many implicit casts implicit. It would be much simpler for many 
subsequent passes if the validator would insert explicit casts where casts are 
needed (e.g., when adding a varchar and an int), and it would catch many bugs 
earlier. Subsequent rewrite passes also have to make assumptions about the 
implicit casts are that are missing, and they may make different assumptions 
from the validator. I have this problem all the time in the code generator.


But this is somewhat independent on your current proposal. I hope that someday 
this could be fixed too.

Mihai
________________________________
From: James Starr <jamesst...@gmail.com>
Sent: Friday, August 23, 2024 2:40 PM
To: dev@calcite.apache.org <dev@calcite.apache.org>
Subject: Re: Breaking up SqlValidatorImpl

I would break up the validator in 3 PRs with no functionality change:
extract SqlQueryScopes and NamespaceBuilder, extract the
SqlQueryScopePopulator, and final PR for extracting SqlNodeValidator.  Then
I would apply functionalchanges on top.

On Fri, Aug 23, 2024 at 2:02 PM Mihai Budiu <mbu...@gmail.com> wrote:

> So do you want to refactor the code in two steps, where the first step
> would break the validator into multiple classes and keep the exact same
> behavior, and in a second step rework the handling of correlation variables?
>
> That sounds great. If you want to do both at the same time, it may be
> harder to review.
>
> Mihai
>
> ________________________________
> From: James Starr <jamesst...@gmail.com>
> Sent: Friday, August 23, 2024 12:26 PM
> To: dev@calcite.apache.org <dev@calcite.apache.org>
> Subject: Breaking up SqlValidatorImpl
>
> I am working on rebasing my for pull request (
> https://github.com/apache/calcite/pull/3042), for CALCITE-5418(
> https://issues.apache.org/jira/browse/CALCITE-5418) which stores the
> correlated variable ids subquery instead of the RelNodes(join, project,
> filter).  This change was briefly discussed here(
> https://lists.apache.org/thread/fvf35njbddcck9hwcqnnxowl49n4nkth).
>
> Storing the correlated IDs directly on the subquery provides several
> benefits:
>
>    - The RelNode interface would no longer need to expose correlated IDs.
>    - There are several existing bugs with how correlated IDs are populated
>    in SqlToRelConverter, particularly for Project and Join operations.
>    - Rules and other rewrite logic for Project, Filter, and Join would no
>    longer need to propagate correlated IDs.
>    - Various edge cases could be removed from SqlToRelConverter.
>    - The code handling correlated variables would be localized to
>    subqueries and lateral joins, instead of being spread across the entire
>    codebase whereever expressions can occur.
>
>
> However, since I started this work, there have been several changes to
> SqlValidatorImpl. Although my initial changes to SqlValidatorImpl were not
> overly complex, they were time-consuming due to the unwieldy nature of
> SqlValidatorImpl. Specifically, I’ve been working on populating a map of
> lateral joins to linked lists of lateral scopes.
>
> SqlValidatorImpl currently performs three main functions:
> 1.  Building a set of data structures for resolving scopes.
> 2.  Validates the query and resolving types.  This requires scope data
> structure generated in 1.
> 3.  Exposes the scoped and type data structure to SqlToRelConverter.
>
> Currently, SqlValidatorImpl is 7,921 lines long. The first step consists of
> 11 methods, each with up to 9 arguments, that recursively walk the tree.
> These methods alone account for nearly 2,000 lines. The data structures for
> resolving scopes, along with helper functions/methods, take up about 500
> lines. The remaining validation logic comprises over 4,000 lines. The
> combination of recursive methods with high argument counts, modifying the
> state of SqlValidatorImpl, all within a single file, makes working with
> SqlValidatorImpl cumbersome.
>
> I propose splitting SqlValidatorImpl into several smaller objects. While
> the recursive algorithms with high arguments would still exist, separating
> them into their own file would allow for a more focused approach. Here’s
> how I suggest breaking it down:
>
>    1. *SqlValidatorImpl* - A facade to maintain the existing API and
>    orchestrate the composition of the objects below.
>    2. *SqlQueryScopes* - Contains the data structures for mapping SqlNodes
>    to scopes and types, along with relevant helper methods.
>    3. *SqlNodeValidator* - Contains most of the methods currently prefixed
>    with validate.
>    4. *SqlQueryScopePopulator* - Contains the methods prefixed with
> register
>    .
>    5. *NamespaceBuilder* - Formalizes an implicit API for overriding
>    namespace creation.
>
> There would be a circular reference between NamespaceBuilder and
> SqlValidatorImpl, as there already exists a circular reference between
> SqlValidatorImpl and all namespace objects. Introducing SqlQueryScopes
> would help clarify the interactions between SqlNodeValidator,
> SqlQueryScopePopulator, and SqlToRelConverter. Additionally, a set of
> factory methods for components 2-5 would be added to SqlValidatorConfig to
> allow them to be overridden.
>
> Looking forward to your thoughts on this approach.
> James
>

Re: Breaking up SqlValidatorImpl

Reply via email to