[GitHub] [incubator-kyuubi] iodone opened a new issue #1832: [Bug] Using the forcedMaxOutputRows extension for statements with subqueries can cause problems with the number of results returned

GitBox Mon, 24 Jan 2022 03:14:39 -0800


iodone opened a new issue #1832:
URL: https://github.com/apache/incubator-kyuubi/issues/1832



   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   
   
   ### Search before asking
   
   - [X] I have searched in the 
[issues](https://github.com/apache/incubator-kyuubi/issues?q=is%3Aissue) and 
found no similar issues.
   
   
   ### Describe the bug
   
   ### Describe
   
   Statements with subqueries have the limit condition error pushed down for 
the final generated logicalplan after setting watchdog.forcedMaxOutputRows.
   
   SQL: 
   ```
   CREATE TABLE spark_catalog.`default`.tmp_table1(KEY INT, VALUE STRING) USING 
PARQUET;
   INSERT INTO TABLE spark_catalog.`default`.tmp_table1 VALUES (1, 
'aa'),(2,'bb'),(3, 'cc'),(4,'aa'),(5,'cc'),(6, 'aa');
   
   select count(*)
   from tmp_table1
   where tmp_table1.key in (
     select distinct tmp_table1.key
     from tmp_table1
     where tmp_table1.value = "aa"
     );
   ```
   Analyzed LogicPlan: 
   ```
   Aggregate [count(1) AS count(1)#62L]
   +- Filter key#56 IN (list#60 [])
      :  +- GlobalLimit 1
      :     +- LocalLimit 1
      :        +- Distinct
      :           +- Project [key#56]
      :              +- Filter (value#57 = aa)
      :                 +- SubqueryAlias spark_catalog.default.tmp_table1
      :                    +- Relation[KEY#56,VALUE#57] parquet
      +- SubqueryAlias spark_catalog.default.tmp_table1
         +- Relation[KEY#56,VALUE#57] parquet
   ```
   Limit rules are pushed down into the Filter, which will result in an 
inaccurate number of final query results.
   
   ### Bug Tracing
   The `extensions.injectPostHocResolutionRule(ForcedMaxOutputRowsRule)` rule 
is injected in the analyzer phase, and a look at the Batches rule in the 
analyzer phase reveals that: 
   
![image](https://user-images.githubusercontent.com/5451385/150771275-baff18bb-24b8-40a8-bf5c-67f8a7ba094b.png)
   SQL with subqueries will first enter the ResolveSubquery Batch during 
analyze, and the implementation of `ResolvedSubQuery` can be known:
   
![image](https://user-images.githubusercontent.com/5451385/150771503-6b0d0192-6184-4765-9af7-112d8a6a9ac0.png)
   The Analyzer's Execute method will be called again, and the analysis of SQL 
statements with subqueries will recursively call Analyzer's Batches. Since we 
added the `ForcedMaxOutputRowsRule`, the subqueries will also be limited with 
Limit Rule, which eventually leads to the above generated logical plan's 
semantics being inconsistent with what we expect.
   
   ### Solutions
   1. Place the `ForcedMaxOutputRowsRule` in the `projectOptimizerRule` stage 
to avoid recursive subquery calls.
   2. Remove `extensions.injectResolutionRule(MarkAggregateOrderRule)`. 
   ###  Some questions: 
   I don't see any relevant cases in the unit tests that hit the Aggregate rule 
without the Limit restriction. And after I removed the markAgg exetension and 
put the `ForcedMaxOutputRowsRule` in the projectOptimizerRule phase, it passed 
all the unit tests
   
   
   ### Affects Version(s)
   
   master/1.4.0
   
   ### Kyuubi Server Log Output
   
   _No response_
   
   ### Kyuubi Engine Log Output
   
   _No response_
   
   ### Kyuubi Server Configurations
   
   _No response_
   
   ### Kyuubi Engine Configurations
   
   _No response_
   
   ### Additional context
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [X] Yes I am willing to submit a PR!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@kyuubi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [incubator-kyuubi] iodone opened a new issue #1832: [Bug] Using the forcedMaxOutputRows extension for statements with subqueries can cause problems with the number of results returned

Reply via email to