[ 
https://issues.apache.org/jira/browse/CALCITE-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17892534#comment-17892534
 ] 

Stamatis Zampetakis commented on CALCITE-6640:
----------------------------------------------

[~julianhyde] The "duplicated" columns scenario that you mention is a slightly 
more complex instance of the "repeated" columns scenario that I have in the 
description. I am afraid that even if we generate minimal keys the problem is 
still exponential.

The a[i] X b[j] example is quadratic because there are two columns in the 
DISTINCT. If the DISTINCT had three columns then the problem becomes cubic 
since we will have to compute a[i] X b[j] X c[k]. If we have _m_ unique columns 
and each unique column is repeated/duplicated _k_ times then we have _k^m_ 
minimal keys.

Currently we generate all possible keys, and when there are column repetitions 
the number of keys is something like 2^(k*m). The current implementation in 
PR#4013 tries to mitigate the problem by creating minimal keys so it 
essentially reduces the size from 2^(k*m) to k^m. If we want to reduce the size 
further, I guess we will have to sacrifice precision.

Regarding the \{{assertMinimal}} check is it something that you would like to 
enable only for the testing phase of this PR or you would like to add it 
permanently in the code?  I am OK with the former but rather skeptical for the 
latter.

> RelMdUniqueKeys grows exponentially when key columns are repeated in 
> projections
> --------------------------------------------------------------------------------
>
>                 Key: CALCITE-6640
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6640
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>
> Consider the following table where empno is a unique key column.
> {code:sql}
> CREATE TABLE emp (
>  empno INT, 
>  ename VARCHAR, 
>  job VARCHAR
>  PRIMARY KEY (empno));
> {code}
> The results of RelMetadataQuery#getUniqueKeys for the following queries are 
> as follows:
> {code:sql}
> SELECT empno FROM emp;
> {0}
> SELECT ename, empno FROM emp;
> {1} 
> SELECT empno, ename, empno FROM emp;
> {0}, {2}, {0, 2}
> SELECT empno, ename, empno, empno FROM emp;
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {code}
> When key columns are repeated in the project the result grows exponentially. 
> This makes the unique key computation very expensive when there are many keys 
> or when keys are repeated multiple times. The problem can lead to OOM errors 
> and queries/rules hanging forever while trying to extract the keys.
> Observe, that the results above are not minimal so currently we are creating 
> and returning a lot of redundant information.
> {noformat}
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {noformat}
> If we know that \{0\}, \{2\}, and \{3\} are unique keys individually then any 
> superset of those is also a unique key so it is sufficient to return just 
> those.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to