[Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Gramana, Zachary (GE Digital) Fri, 11 Mar 2022 17:27:11 -0800

Hello all!

I would appreciate some help/guidance with troubleshooting a big performance 
regression we saw on our UNION queries after updating our application from 1.24 
to 1.29. Many of the tables in the queries are wide (> 50 columns) so we use a 
custom push-down Project class. These custom Projects are being eliminated in 
1.29 UNION queries--though not when replacing UNION with UNION ALL.


The examples below illustrate minimal queries that reproduce the issue our test 
database.

When compared to the UNION query times using 1.24, the prepared UNION 
statements take 2-2.5 times longer in 1.29. The prepare time is around 2 times 
longer as well.

In the examples below, the UNION and UNION ALL queries generate the same result 
set in our test database. All Calcite tables have default collations 
("ISO-8859-1$en_US"). Several are rather wide, with many > 50 columns.

In 1.29 for a simple query, both UNION and UNION ALL produce the same query 
plan, with the exception of EnumerableUnion's `all` parameter:

1.29
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION ALL
SELECT Id FROM t2
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[true])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t2]])

1.29
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION
SELECT Id FROM t2
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[false])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t2]])

The next examples add a JOIN to one of the tables in order to make this a 
slightly more complicated query and trigger the issue. Unlike in 1.24, the 
plans in 1.29 for UNION and UNION ALL now vary significantly:

1.24 (like the above queries,  the UNION and UNION ALL plans differ only by the 
`all` parameter, so the UNION ALL plan for 1.24 is omitted)
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION
SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[false])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
    EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT 
NULL], t3_Id0=[$t263])
          MyEnumerableConverter
            MyTableScan(table=[[MySchema, t2]])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        MyEnumerableConverter
          MyProject(Id=[$0])
            MyTableScan(table=[[MySchema, t3]])

1.29 (note this UNION ALL plan is nearly identical plan to the above `UNION` 
plan in 1.24)
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION ALL
SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[true])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
    EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT 
NULL], t3_Id0=[$t263])
          MyEnumerableConverter
            MyTableScan(table=[[MySchema, t2]])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        MyEnumerableConverter
          MyProject(Id=[$0])
            MyTableScan(table=[[MySchema, t3]])

1.29 (note the loss of the `MyProject` node above t3's MyTableScan, and now the 
inputs for the post-MergeJoin EnumerableCalc node are very wide compared to 
before)
EXPLAIN PLAN FOR
SELECT Id FROM t1
UNION
SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
-----------------------------------------------------------------------------------------------------
EnumerableUnion(all=[false])
  MyEnumerableConverter
    MyProject(Id=[$0])
      MyTableScan(table=[[MySchema, t1]])
  EnumerableCalc(expr#0..402=[{inputs}], Id=[$t264])
    EnumerableMergeJoin(condition=[=($263, $264)], joinType=[inner])
      EnumerableSort(sort0=[$263], dir0=[ASC])
        EnumerableCalc(expr#0..262=[{inputs}], expr#263=[CAST($t178):BIGINT NOT 
NULL], proj#0..263=[{exprs}])
          MyEnumerableConverter
            MyTableScan(table=[[MySchema, t2]])
      EnumerableSort(sort0=[$0], dir0=[ASC])
        MyEnumerableConverter
          MyTableScan(table=[[MySchema, t3]])

I've tried removing various rules and even restricting them, but the custom 
Projects are still removed in the EnumerableMergeJoin despite setting the 
self-cost to zero.

Am I missing something? Any suggestions on where to look next?

Thanks for any assistance/tips.

Best,
Zack

[Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Reply via email to