avamingli commented on PR #900:
URL: https://github.com/apache/cloudberry/pull/900#issuecomment-2662081594
> > My goal here is to do some "magic" for database users, not for us as
database developers.
>
> As you mentioned, this does not benefit the kernel code in any way. On the
contrary, it masks the problem and increases maintenance complexity. I stand by
my point that this PR is not a good idea.
>
> Your example shows that ORCA is significantly slower when inserting data,
but your conclusion generalizes to so-called simple queries. However, there are
no examples or test data provided. For instance, what is the query plan for a
statement like `select * from t`? This test seems simple, yet it is missing.
Additionally, as I mentioned, what proportion of these so-called simple queries
are actually `insert` statements? If the majority of the issues are caused by
`insert`, we have even less reason to disable simple queries like `select`.
>
> I don’t mean to be rude, but while I find this requirement hardly worth
discussing in terms of code, the code itself is quite sloppy. For example, the
issue I raised about subqueries is not addressed in your code.
Besides all the concerns which are not resolved: **subquery, proportion of
these so-called simple queries are actually `insert` statements.**
I believe there are fundamental issues with the current implementation
regarding the optimizer_relations_threshold.
**1.Partitioned Tables:**
Partitioned Tables: When using a partitioned table (let's call it par) with
optimizer_relations_threshold set to 1, this means that ORCA is chosen as the
optimizer. However, when there are two or more range tables involved in a
query, the optimizer_relations_threshold can prevent ORCA from being utilized
effectively. This undermines one of ORCA's key features: partition pruning.
Consequently, the benefits of ORCA are lost.
```sql
CREATE TABLE partrl (a int, b int, c int)
DISTRIBUTED BY (a)
PARTITION BY range(b)
SUBPARTITION BY list(c)
(
PARTITION p1 START (10) END (20) EVERY (5)
(
SUBPARTITION sp1 VALUES (1, 2)
),
PARTITION p2 START (0) END (10)
(
SUBPARTITION sp2 VALUES (3, 4),
SUBPARTITION sp1 VALUES (1, 2),
DEFAULT SUBPARTITION others
)
);
SELECT * FROM patrl;
```
A single entry in rtable but in fact there would be all children tables
during planner expanding those.
**2.Inherited Tables:**
Inherited Tables: The same logic applies to inherited tables. When selecting
from a parent table, there can be multiple child tables. This scenario also
violates the notion of a "simple query," which is defined as having fewer range
table entries than the optimizer_relations_threshold.
**3. Union/INTESECT/xxx**
Union Operations: Similarly, in cases involving UNION, such as SELECT * FROM
t1 UNION ALL SELECT * FROM t2, the planner expands these into multiple range
tables, leading to scenarios where the number of range table entries exceeds
the threshold.
**4. GUC compatibility**
Adding this code may create complications in the future, especially if we
later resolve the issues with ORCA. We could end up with redundant code and
GUCs that are no longer useful, which could lead to confusion and compatibility
issues for users who have already adopted them.
**5. Put the ORCA codes into ORCA**
BTW, one simple rule is that ORCA GUCs should be used in ORCA codes, not pg
planner side.
```optimizer_xxxx``` means it belongs to ORCA c++ codes, even you tried to
implement it, please obey the rule if you want to fallback to pg planner for
some reasons.
These examples illustrate how the current design of
optimizer_relations_threshold is problematic. I suspect there may be additional
issues that I have not identified.
While I have outlined these concerns, I do not expect immediate fixes.
My point is that we should address the root problems with ORCA rather than
introducing workarounds that complicate the kernel codes.
I'm not convinced that it's worth discussing further unless there are
stronger reasons and test data presented as it was.
Instead of pursuing this approach, our focus should be on fixing the real
problems with ORCA.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]