[Question][Advice] Analyzing RelNodes and RexNodes Accurately

JinxTheKid Thu, 22 Aug 2024 21:03:19 -0700

Hi Calcite community,

I'm exploring using the Rel/Rex Nodes to answer questions about a query.
One of the questions I want to ask is "what columns are used by functions
in a query's projections". Naively, I would like to just walk the Rel/Rex
Nodes and collect column origins when I'm at specific RexCalls, but I've
noticed that Calcite generates superfluous ProjectNodes when using
SqlToRelConverter, which throws a wrench in my original approach for
analyzing the query.


For example, a query I might want to interrogate (in practice the queries
are much more complex) is:

SELECT
    1
FROM
    employees e
GROUP BY
    uuid
HAVING
    AVG(age + 5) > 3

The query generates a logical plan in Calcite that looks like the
following. The issue with this plan is that if I blindly look at the plan,
I do not know if the LogicalProject corresponds to a SELECT in the original
query, or if it is injected by Calcite to create a better plan.I would
ideally want something that models the original query closer. I could use
the SqlNode tree, but I thought that the RelTree would be a better place to
analyze the query in this manner.

LogicalProject(EXPR$0=[1])
  LogicalFilter(condition=[>($1, 3)])
    LogicalAggregate(group=[{0}], agg#0=[AVG($1)])
      LogicalProject(uuid=[$0], $f1=[+($2, 5)])
        LogicalTableScan(table=[[employees]])

In the previous plan, the second project (or at least my understanding
of it) is superfluous because this second projection was not written
in the original query,and an alternative plan could be generated that
does not use an extra project.

LogicalProject(EXPR$0=[1])
  LogicalFilter(condition=[>($1, 3)])
    LogicalAggregate(group=[{0}], agg#0=[AVG(+($2, 5))])
      LogicalTableScan(table=[[employees]])

When I take a look at a plan created by Apache Spark for the same query, I
get a plan that is closer to how the original query was written. Some
values are repeated in the plan (at least in the explanation), but no
values are projected in the plan that weren't present in the original
query. To me, this would be a more desirable plan to analyze since I as the
developer have to do less work to understand the origin of the plan's
rules; i.e. I'm not guessing if the project that is generated is Spark's
project or mine.

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [1 AS 1#6]
   +- Filter (isnotnull(avg((age#2 + 5))#8) AND (avg((age#2 + 5))#8 > 3.0))
      +- HashAggregate(keys=[uuid#0], functions=[avg((age#2 + 5))])
         +- Exchange hashpartitioning(uuid#0, 200),
ENSURE_REQUIREMENTS, [plan_id=22]
            +- HashAggregate(keys=[uuid#0],
functions=[partial_avg((age#2 + 5))])
               +- LocalTableScan [uuid#0, age#2]


I have two questions from my experiments:
1. Is my approach of walking the tree on the correct path, and is there a
way to model the strategy using Calcite primitives that I might have
missed? I've played around with CoreRules a bit but haven't made
significant progress in that area.
2. Is there a way to generate a plan closer to the Spark version, or are
there other community efforts that might point me in the right direction
for achieving the desired behavior?

Thanks,
Logan

[Question][Advice] Analyzing RelNodes and RexNodes Accurately

Reply via email to