Re: [PR] [multistage][bugfix] eliminate multiple exchanges [pinot]

via GitHub Thu, 26 Oct 2023 12:51:11 -0700


walterddr commented on code in PR #11882:
URL: https://github.com/apache/pinot/pull/11882#discussion_r1373742525



##########
pinot-query-planner/src/test/resources/queries/JoinPlans.json:
##########
@@ -421,6 +421,28 @@
           "\n              LogicalTableScan(table=[[b]])",
           "\n"
         ]
+      },
+      {
+        "description": "nexted reused tmp table SEMI JOINs",
+        "sql": "EXPLAIN PLAN FOR WITH tmp1 AS ( SELECT * FROM a WHERE col2 NOT 
IN ('foo', 'bar') ), tmp2 AS ( SELECT * FROM b WHERE col1 IN (SELECT col1 FROM 
tmp1) AND col3 < 100 ) SELECT * FROM tmp2 WHERE col1 IN (SELECT col1 from tmp1 
WHERE col3 > 10)",
+        "output": [
+          "Execution Plan",
+          "\nLogicalJoin(condition=[=($0, $7)], joinType=[semi])",
+          "\n  PinotLogicalExchange(distribution=[hash[0]])",
+          "\n    LogicalJoin(condition=[=($0, $7)], joinType=[semi])",
+          "\n      LogicalFilter(condition=[<($2, 100)])",
+          "\n        LogicalTableScan(table=[[b]])",
+          "\n      PinotLogicalExchange(distribution=[broadcast], 
relExchangeType=[PIPELINE_BREAKER])",
+          "\n        LogicalProject(col1=[$0], col2=[$1])",
+          "\n          LogicalFilter(condition=[AND(<>($1, 'bar'), <>($1, 
'foo'))])",
+          "\n            LogicalTableScan(table=[[a]])",
+          "\n  PinotLogicalExchange(distribution=[hash[0]])",
+          "\n    LogicalProject(col1=[$0], col3=[$2])",
+          "\n      LogicalFilter(condition=[>($2, 10)])",

Review Comment:
   this is not unexpected. (we can potentially make it more smart). the problem 
is that
   ```
   WITH tmp1
        AS (SELECT *
            FROM   a
            WHERE  col2 NOT IN ( 'foo', 'bar' )),
        tmp2
        AS (SELECT *
            FROM   b
            WHERE  col1 IN (SELECT col1
                            FROM   tmp1)
                   AND col3 < 100)
   SELECT *
   FROM   tmp2
   WHERE  col1 IN (SELECT col1
                   FROM   tmp1  <-- this is the tmp1 modification into tmp1'
                   WHERE  col3 > 10) 
   ```
   at this stage Calcite's doesn't know if you want to do a table spool of tmp1 
and make tmp1' derived from tmp1 --> thus col3 > 10 is not pushed down. 
   only when we decided that table spool is not possible and then we decided to 
copy tmp1 and tmp1' into 2 separate sub queries, at that time i think the hep 
optimizer already get passed the filter merging phase. 
   
   rewriting the query in this way
   ```
   WITH tmp1
        AS (SELECT *
            FROM   a
            WHERE  col2 NOT IN ( 'foo', 'bar' )),
        tmp2
        AS (SELECT *
            FROM   b
            WHERE  col1 IN (SELECT col1
                            FROM   tmp1)
                   AND col3 < 100),
        tmp3 -- here we explicitly tell that tmp3 and tmp1 are not related
        AS (SELECT *
            FROM   a
            WHERE  col2 NOT IN ( 'foo', 'bar' ) AND col3 > 10),
   SELECT *
   FROM   tmp2
   WHERE  col1 IN (SELECT col1
                   FROM   tmp3) 
   ```
   produces no multi-filter plan



##########
pinot-query-planner/src/test/resources/queries/JoinPlans.json:
##########
@@ -421,6 +421,28 @@
           "\n              LogicalTableScan(table=[[b]])",
           "\n"
         ]
+      },
+      {
+        "description": "nexted reused tmp table SEMI JOINs",
+        "sql": "EXPLAIN PLAN FOR WITH tmp1 AS ( SELECT * FROM a WHERE col2 NOT 
IN ('foo', 'bar') ), tmp2 AS ( SELECT * FROM b WHERE col1 IN (SELECT col1 FROM 
tmp1) AND col3 < 100 ) SELECT * FROM tmp2 WHERE col1 IN (SELECT col1 from tmp1 
WHERE col3 > 10)",
+        "output": [
+          "Execution Plan",
+          "\nLogicalJoin(condition=[=($0, $7)], joinType=[semi])",
+          "\n  PinotLogicalExchange(distribution=[hash[0]])",
+          "\n    LogicalJoin(condition=[=($0, $7)], joinType=[semi])",
+          "\n      LogicalFilter(condition=[<($2, 100)])",
+          "\n        LogicalTableScan(table=[[b]])",
+          "\n      PinotLogicalExchange(distribution=[broadcast], 
relExchangeType=[PIPELINE_BREAKER])",
+          "\n        LogicalProject(col1=[$0], col2=[$1])",
+          "\n          LogicalFilter(condition=[AND(<>($1, 'bar'), <>($1, 
'foo'))])",
+          "\n            LogicalTableScan(table=[[a]])",
+          "\n  PinotLogicalExchange(distribution=[hash[0]])",
+          "\n    LogicalProject(col1=[$0], col3=[$2])",
+          "\n      LogicalFilter(condition=[>($2, 10)])",

Review Comment:
   this is not unexpected. (we can potentially make it more smart). the problem 
is that
   ```
   WITH tmp1
        AS (SELECT *
            FROM   a
            WHERE  col2 NOT IN ( 'foo', 'bar' )),
        tmp2
        AS (SELECT *
            FROM   b
            WHERE  col1 IN (SELECT col1
                            FROM   tmp1)
                   AND col3 < 100)
   SELECT *
   FROM   tmp2
   WHERE  col1 IN (SELECT col1
                   FROM   tmp1  -- this is the tmp1 modification into tmp1'
                   WHERE  col3 > 10) 
   ```
   at this stage Calcite's doesn't know if you want to do a table spool of tmp1 
and make tmp1' derived from tmp1 --> thus col3 > 10 is not pushed down. 
   only when we decided that table spool is not possible and then we decided to 
copy tmp1 and tmp1' into 2 separate sub queries, at that time i think the hep 
optimizer already get passed the filter merging phase. 
   
   rewriting the query in this way
   ```
   WITH tmp1
        AS (SELECT *
            FROM   a
            WHERE  col2 NOT IN ( 'foo', 'bar' )),
        tmp2
        AS (SELECT *
            FROM   b
            WHERE  col1 IN (SELECT col1
                            FROM   tmp1)
                   AND col3 < 100),
        tmp3 -- here we explicitly tell that tmp3 and tmp1 are not related
        AS (SELECT *
            FROM   a
            WHERE  col2 NOT IN ( 'foo', 'bar' ) AND col3 > 10),
   SELECT *
   FROM   tmp2
   WHERE  col1 IN (SELECT col1
                   FROM   tmp3) 
   ```
   produces no multi-filter plan



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [multistage][bugfix] eliminate multiple exchanges [pinot]

Reply via email to