Re: [Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Gramana, Zachary (GE Digital) Thu, 17 Mar 2022 18:09:16 -0700

I just sent a PR with a test case: https://github.com/apache/calcite/pull/2747


I will file a new JIRA report tomorrow.

Regards,
Zack

On 3/11/22, 10:36 PM, "Haisheng Yuan" <[email protected]> wrote:

    Hi Zack,

    Looks like it is a regression.
    Are you able to provide a reproducible test case? You can log a JIRA along 
with the test case, so people can do the root cause analysis.

    Thanks,
    Haisheng Yuan

    On 2022/03/12 01:26:53 "Gramana, Zachary (GE Digital)" wrote:
    > Hello all!
    > 
    > I would appreciate some help/guidance with troubleshooting a big 
performance regression we saw on our UNION queries after updating our 
application from 1.24 to 1.29. Many of the tables in the queries are wide (> 50 
columns) so we use a custom push-down Project class. These custom Projects are 
being eliminated in 1.29 UNION queries--though not when replacing UNION with 
UNION ALL.
    > 
    > The examples below illustrate minimal queries that reproduce the issue 
our test database.
    > 
    > When compared to the UNION query times using 1.24, the prepared UNION 
statements take 2-2.5 times longer in 1.29. The prepare time is around 2 times 
longer as well.
    > 
    > In the examples below, the UNION and UNION ALL queries generate the same 
result set in our test database. All Calcite tables have default collations 
("ISO-8859-1$en_US"). Several are rather wide, with many > 50 columns.
    > 
    > In 1.29 for a simple query, both UNION and UNION ALL produce the same 
query plan, with the exception of EnumerableUnion's `all` parameter:
    > 
    > 1.29
    > EXPLAIN PLAN FOR
    > SELECT Id FROM t1
    > UNION ALL
    > SELECT Id FROM t2
    > 
-----------------------------------------------------------------------------------------------------
    > EnumerableUnion(all=[true])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t1]])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t2]])
    > 
    > 1.29
    > EXPLAIN PLAN FOR
    > SELECT Id FROM t1
    > UNION
    > SELECT Id FROM t2
    > 
-----------------------------------------------------------------------------------------------------
    > EnumerableUnion(all=[false])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t1]])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t2]])
    > 
    > The next examples add a JOIN to one of the tables in order to make this a 
slightly more complicated query and trigger the issue. Unlike in 1.24, the 
plans in 1.29 for UNION and UNION ALL now vary significantly:
    > 
    > 1.24 (like the above queries,  the UNION and UNION ALL plans differ only 
by the `all` parameter, so the UNION ALL plan for 1.24 is omitted)
    > EXPLAIN PLAN FOR
    > SELECT Id FROM t1
    > UNION
    > SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
    > 
-----------------------------------------------------------------------------------------------------
    > EnumerableUnion(all=[false])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t1]])
    >   EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
    >     EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
    >       EnumerableSort(sort0=[$0], dir0=[ASC])
    >         EnumerableCalc(expr#0..262=[{inputs}], 
expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
    >           MyEnumerableConverter
    >             MyTableScan(table=[[MySchema, t2]])
    >       EnumerableSort(sort0=[$0], dir0=[ASC])
    >         MyEnumerableConverter
    >           MyProject(Id=[$0])
    >             MyTableScan(table=[[MySchema, t3]])
    > 
    > 1.29 (note this UNION ALL plan is nearly identical plan to the above 
`UNION` plan in 1.24)
    > EXPLAIN PLAN FOR
    > SELECT Id FROM t1
    > UNION ALL
    > SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
    > 
-----------------------------------------------------------------------------------------------------
    > EnumerableUnion(all=[true])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t1]])
    >   EnumerableCalc(expr#0..1=[{inputs}], Id=[$t1])
    >     EnumerableMergeJoin(condition=[=($0, $1)], joinType=[inner])
    >       EnumerableSort(sort0=[$0], dir0=[ASC])
    >         EnumerableCalc(expr#0..262=[{inputs}], 
expr#263=[CAST($t178):BIGINT NOT NULL], t3_Id0=[$t263])
    >           MyEnumerableConverter
    >             MyTableScan(table=[[MySchema, t2]])
    >       EnumerableSort(sort0=[$0], dir0=[ASC])
    >         MyEnumerableConverter
    >           MyProject(Id=[$0])
    >             MyTableScan(table=[[MySchema, t3]])
    > 
    > 1.29 (note the loss of the `MyProject` node above t3's MyTableScan, and 
now the inputs for the post-MergeJoin EnumerableCalc node are very wide 
compared to before)
    > EXPLAIN PLAN FOR
    > SELECT Id FROM t1
    > UNION
    > SELECT t3.Id FROM t2 JOIN t3 ON (t3.Id = t2.t3_Id)
    > 
-----------------------------------------------------------------------------------------------------
    > EnumerableUnion(all=[false])
    >   MyEnumerableConverter
    >     MyProject(Id=[$0])
    >       MyTableScan(table=[[MySchema, t1]])
    >   EnumerableCalc(expr#0..402=[{inputs}], Id=[$t264])
    >     EnumerableMergeJoin(condition=[=($263, $264)], joinType=[inner])
    >       EnumerableSort(sort0=[$263], dir0=[ASC])
    >         EnumerableCalc(expr#0..262=[{inputs}], 
expr#263=[CAST($t178):BIGINT NOT NULL], proj#0..263=[{exprs}])
    >           MyEnumerableConverter
    >             MyTableScan(table=[[MySchema, t2]])
    >       EnumerableSort(sort0=[$0], dir0=[ASC])
    >         MyEnumerableConverter
    >           MyTableScan(table=[[MySchema, t3]])
    > 
    > I've tried removing various rules and even restricting them, but the 
custom Projects are still removed in the EnumerableMergeJoin despite setting 
the self-cost to zero.
    > 
    > Am I missing something? Any suggestions on where to look next?
    > 
    > Thanks for any assistance/tips.
    > 
    > Best,
    > Zack
    > 
    >

Re: [Help] UNION queries with one or more SELECTs containing a JOIN are significantly slower after 1.24

Reply via email to