[ 
https://issues.apache.org/jira/browse/CALCITE-6640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17893118#comment-17893118
 ] 

Julian Hyde commented on CALCITE-6640:
--------------------------------------

Thanks for clarifying. I think there are two problems here. The first, which 
you have fixed in the PR, is failure to abide by the existing contract. That 
contract is expressed in my assertMinimal function. I think that we should 
enforce that contract - throw at runtime if the set of keys is not minimal - in 
order to root out bad metadata producers. 

The second problem is harder to solve, because the minimal list of keys has 
exponential size. I believe it only occurs when keys are composite and at least 
one of the columns in those keys are projected repeatedly. 

I think the problems should be solved with independent commits. 

My proposed solution to the second problem is to add an optional “int limit” 
parameter to the unique keys metadata method. -1 should mean no limit but the 
default should be a moderate value, say 1000. 

When columns are projected repeatedly the implementation should (not must) 
generate the keys based on the first occurrences of the repetition. Thus the 
subset of keys returned will be more likely to include at least one instance of 
each distinct key. (Hope it’s clear by what I mean by “distinct”.)



> RelMdUniqueKeys grows exponentially when key columns are repeated in 
> projections
> --------------------------------------------------------------------------------
>
>                 Key: CALCITE-6640
>                 URL: https://issues.apache.org/jira/browse/CALCITE-6640
>             Project: Calcite
>          Issue Type: Bug
>          Components: core
>            Reporter: Stamatis Zampetakis
>            Assignee: Stamatis Zampetakis
>            Priority: Major
>              Labels: pull-request-available
>
> Consider the following table where empno is a unique key column.
> {code:sql}
> CREATE TABLE emp (
>  empno INT, 
>  ename VARCHAR, 
>  job VARCHAR
>  PRIMARY KEY (empno));
> {code}
> The results of RelMetadataQuery#getUniqueKeys for the following queries are 
> as follows:
> {code:sql}
> SELECT empno FROM emp;
> {0}
> SELECT ename, empno FROM emp;
> {1} 
> SELECT empno, ename, empno FROM emp;
> {0}, {2}, {0, 2}
> SELECT empno, ename, empno, empno FROM emp;
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {code}
> When key columns are repeated in the project the result grows exponentially. 
> This makes the unique key computation very expensive when there are many keys 
> or when keys are repeated multiple times. The problem can lead to OOM errors 
> and queries/rules hanging forever while trying to extract the keys.
> Observe, that the results above are not minimal so currently we are creating 
> and returning a lot of redundant information.
> {noformat}
> {0}, {2}, {3}, {0, 2}, {0 3}, {2, 3}, {0, 2, 3}
> {noformat}
> If we know that \{0\}, \{2\}, and \{3\} are unique keys individually then any 
> superset of those is also a unique key so it is sufficient to return just 
> those.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to