Re: [PR] Do not call gporca for simple queries [cloudberry]

via GitHub Sun, 16 Feb 2025 21:56:12 -0800


avamingli commented on PR #900:
URL: https://github.com/apache/cloudberry/pull/900#issuecomment-2662081594


   > > My goal here is to do some "magic" for database users, not for us as 
database developers.
   > 
   > As you mentioned, this does not benefit the kernel code in any way. On the 
contrary, it masks the problem and increases maintenance complexity. I stand by 
my point that this PR is not a good idea.
   > 
   > Your example shows that ORCA is significantly slower when inserting data, 
but your conclusion generalizes to so-called simple queries. However, there are 
no examples or test data provided. For instance, what is the query plan for a 
statement like `select * from t`? This test seems simple, yet it is missing. 
Additionally, as I mentioned, what proportion of these so-called simple queries 
are actually `insert` statements? If the majority of the issues are caused by 
`insert`, we have even less reason to disable simple queries like `select`.
   > 
   > I don’t mean to be rude, but while I find this requirement hardly worth 
discussing in terms of code, the code itself is quite sloppy. For example, the 
issue I raised about subqueries is not addressed in your code.
   
   Besides all the concerns which are not resolved: **subquery, proportion of 
these so-called simple queries are actually `insert` statements.**
   
   I believe there are fundamental issues with the current implementation 
regarding the optimizer_relations_threshold.
   
   **1.Partitioned Tables:**
   
   Partitioned Tables: When using a partitioned table (let's call it par) with 
optimizer_relations_threshold set to 1, this means that ORCA is chosen as the 
optimizer. However, when there are two or more range tables involved in a 
query, the optimizer_relations_threshold can prevent ORCA from being utilized 
effectively. This undermines one of ORCA's key features: partition pruning. 
Consequently, the benefits of ORCA are lost.
   
   ```sql
   CREATE TABLE partrl (a int, b int, c int)
   DISTRIBUTED BY (a)
   PARTITION BY range(b)
   SUBPARTITION BY list(c)
   (
           PARTITION p1 START (10) END (20) EVERY (5)
           (
                   SUBPARTITION sp1 VALUES (1, 2)
           ),
           PARTITION p2 START (0) END (10)
           (
                   SUBPARTITION sp2 VALUES (3, 4),
                   SUBPARTITION sp1 VALUES (1, 2),
                   DEFAULT SUBPARTITION others
           )
   );
   
   SELECT * FROM patrl; 
   ```
   A single entry in rtable but in fact there would be all children tables 
during planner expanding those.
   
   **2.Inherited Tables:** 
   
   Inherited Tables: The same logic applies to inherited tables. When selecting 
from a parent table, there can be multiple child tables. This scenario also 
violates the notion of a "simple query," which is defined as having fewer range 
table entries than the optimizer_relations_threshold.
   
   **3. Union/INTESECT/xxx**
   
   Union Operations: Similarly, in cases involving UNION, such as SELECT * FROM 
t1 UNION ALL SELECT * FROM t2, the planner expands these into multiple range 
tables, leading to scenarios where the number of range table entries exceeds 
the threshold.
   
   
   
   **4. GUC compatibility**
   
   Adding this code may create complications in the future, especially if we 
later resolve the issues with ORCA. We could end up with redundant code and 
GUCs that are no longer useful, which could lead to confusion and compatibility 
issues for users who have already adopted them. 
   
   **5. Put the ORCA codes into  ORCA**
   
   BTW, one simple rule is that ORCA GUCs should be used in ORCA codes, not pg 
planner side. 
   ```optimizer_xxxx``` means it belongs to ORCA c++ codes, even you tried to 
implement it, please obey the rule if you want to fallback to pg planner for 
some reasons.
   
   
   These examples illustrate how the current design of 
optimizer_relations_threshold is problematic. I suspect there may be additional 
issues that I have not identified.
   While I have outlined these concerns, I do not expect immediate fixes.
   My point is that we should address the root problems with ORCA rather than 
introducing workarounds that complicate the kernel codes.
   
   I'm not convinced that it's worth discussing further unless  there are 
stronger reasons and test data presented as it was.
   Instead of pursuing this approach, our focus should be on fixing the real 
problems with ORCA.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Do not call gporca for simple queries [cloudberry]

Reply via email to