[Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Vladimir Golubev Fri, 09 Aug 2024 07:48:36 -0700

Hello All,

I recently did some research in the Catalyst Analyzer area to check if it’s
possible to make it single-pass instead of fixed-point. Despite the
flexibility of the current fixed-point approach (new functionality - new
rule), it has some drawbacks. The dependencies between the rules are
unobvious, so it’s hard to introduce changes without having the full
knowledge. By modifying one rule, the whole chain of transformations can
change in an unobvious way. Since we can hit the maximum number of
iterations, there’s no guarantee that the plan is going to be resolved. And
from a performance perspective the Analyzer can be quite slow for large
logical plans and wide tables.


The idea is to resolve the tree in a single-pass post-order traversal. This
approach can be beneficial for the following reasons:

   -

   Code can rely on child nodes being resolved
   -

   Name visibility can be controlled by a scope, which would be maintained
   during the traversal
   -

   Random in-tree lookups are disallowed during the traversal, so the
   resolution algorithm would be linear with respect to the parsed tree size
   -

   Analyzer would deterministically produce resolved logical plan or a
   descriptive error


The prototype shows that it’s possible to do a bottom-up Analyzer
implementation and reuse a large chunk of node-processing code from rule
bodies. I did the performance benchmark on two cases where the current
Analyzer struggles the most - wide nested views and large logical plans and
got 7-10x performance speedups.

Is this something that would be interesting for the Spark community? What
would be the best way to proceed with this idea?

Regards,

Vladimir Golubev.

[Spark SQL] A single-pass resolution approach for the Catalyst Analyzer

Reply via email to