[
https://issues.apache.org/jira/browse/CALCITE-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893963#comment-17893963
]
Julian Hyde commented on CALCITE-6640:
--------------------------------------
As an aside. I was surprised to find that 'select a as a0, a as a1, b as b0, b
as b1, c as c0, c as c1 from (select distinct a, b, c from t)' (3 groups of 2
columns has 2^3 = 8 keys) had fewer keys than 'select a as a0, a as a1, a as
a2, b as b0, b as b1, b as b2 from (select distinct a, b from t)' (2 groups of
3 columns has 3^2 = 9 keys), so I kicked off an exploration of combinatorics
and optimization.
If there there are X groups of Y columns then the number of keys is X^Y. I had
assumed that if X*Y is constant (e.g. there are 6 fields arranged into groups
of size X) then X^Y was maximized when X = 2. In this case, X^Y is maximized
when X=3.
I found a [fascinating video|https://www.youtube.com/watch?v=zdAJXil-NvA] on
maximizing X^Y when X+Y=constant, and it leads to the Lambert W function.
> RelMdUniqueKeys grows exponentially when key columns are repeated in
> projections
> --------------------------------------------------------------------------------
>
> Key: CALCITE-6640
> URL: https://issues.apache.org/jira/browse/CALCITE-6640
> Project: Calcite
> Issue Type: Bug
> Components: core
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
> Labels: pull-request-available
>
> Consider the following table where empno is a unique key column.
> {code:sql}
> CREATE TABLE emp (
> empno INT,
> ename VARCHAR,
> job VARCHAR
> PRIMARY KEY (empno));
> {code}
> The results of RelMetadataQuery#getUniqueKeys for the following queries are
> as follows:
> {code:sql}
> SELECT empno FROM emp;
> {0}
> SELECT ename, empno FROM emp;
> {1}
> SELECT empno, ename, empno FROM emp;
> {0}, {2}, {0, 2}
> SELECT empno, ename, empno, empno FROM emp;
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {code}
> When key columns are repeated in the project the result grows exponentially.
> This makes the unique key computation very expensive when there are many keys
> or when keys are repeated multiple times. The problem can lead to OOM errors
> and queries/rules hanging forever while trying to extract the keys.
> Observe, that the results above are not minimal so currently we are creating
> and returning a lot of redundant information.
> {noformat}
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {noformat}
> If we know that \{0\}, \{2\}, and \{3\} are unique keys individually then any
> superset of those is also a unique key so it is sufficient to return just
> those.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)