[
https://issues.apache.org/jira/browse/CALCITE-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892534#comment-17892534
]
Stamatis Zampetakis commented on CALCITE-6640:
----------------------------------------------
[~julianhyde] The "duplicated" columns scenario that you mention is a slightly
more complex instance of the "repeated" columns scenario that I have in the
description. I am afraid that even if we generate minimal keys the problem is
still exponential.
The a[i] X b[j] example is quadratic because there are two columns in the
DISTINCT. If the DISTINCT had three columns then the problem becomes cubic
since we will have to compute a[i] X b[j] X c[k]. If we have _m_ unique columns
and each unique column is repeated/duplicated _k_ times then we have _k^m_
minimal keys.
Currently we generate all possible keys, and when there are column repetitions
the number of keys is something like 2^(k*m). The current implementation in
PR#4013 tries to mitigate the problem by creating minimal keys so it
essentially reduces the size from 2^(k*m) to k^m. If we want to reduce the size
further, I guess we will have to sacrifice precision.
Regarding the \{{assertMinimal}} check is it something that you would like to
enable only for the testing phase of this PR or you would like to add it
permanently in the code? I am OK with the former but rather skeptical for the
latter.
> RelMdUniqueKeys grows exponentially when key columns are repeated in
> projections
> --------------------------------------------------------------------------------
>
> Key: CALCITE-6640
> URL: https://issues.apache.org/jira/browse/CALCITE-6640
> Project: Calcite
> Issue Type: Bug
> Components: core
> Reporter: Stamatis Zampetakis
> Assignee: Stamatis Zampetakis
> Priority: Major
> Labels: pull-request-available
>
> Consider the following table where empno is a unique key column.
> {code:sql}
> CREATE TABLE emp (
> empno INT,
> ename VARCHAR,
> job VARCHAR
> PRIMARY KEY (empno));
> {code}
> The results of RelMetadataQuery#getUniqueKeys for the following queries are
> as follows:
> {code:sql}
> SELECT empno FROM emp;
> {0}
> SELECT ename, empno FROM emp;
> {1}
> SELECT empno, ename, empno FROM emp;
> {0}, {2}, {0, 2}
> SELECT empno, ename, empno, empno FROM emp;
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {code}
> When key columns are repeated in the project the result grows exponentially.
> This makes the unique key computation very expensive when there are many keys
> or when keys are repeated multiple times. The problem can lead to OOM errors
> and queries/rules hanging forever while trying to extract the keys.
> Observe, that the results above are not minimal so currently we are creating
> and returning a lot of redundant information.
> {noformat}
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {noformat}
> If we know that \{0\}, \{2\}, and \{3\} are unique keys individually then any
> superset of those is also a unique key so it is sufficient to return just
> those.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)