[jira] [Commented] (CALCITE-5559) Improve RepeatUnion by discarding duplicates at TableSpool level

Stamatis Zampetakis (Jira) Tue, 07 Mar 2023 02:09:05 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17697334#comment-17697334
 ]


Stamatis Zampetakis commented on CALCITE-5559:
----------------------------------------------

I am leaning more towards *B* even though it looks a bit more complicated.

If I understood well  we cannot have an {{Aggregate}} below a {{TableSpool}} 
operator due to underlying limitations of the respective {{Enumerable}} 
operators. This also means that if we wanted to represent a recursive SQL query 
with aggregation this wouldn't work at this point in {{EnumerableConvention}}. 
Changing/Enhancing the physical implementation of {{Aggregate}} has the 
additional benefit that these kind of queries would be supported.

I know that many popular DBMS impose some limitations around the usage of GROUP 
BY, HAVING, DISTINCT, etc., in recursive queries. It may be worth trying to 
understand the reason behind these limitations before moving further with *B* 
to avoid hitting a wall later on. We don't really need to adhere or impose the 
same limitations with other DBMS, cause algebra is more powerful than SQL, but 
it may help in taking a more informed decision.

Regarding the eager vs. lazy evaluation of aggregations there is always the 
option of keeping both and let the optimizer/rules decide which one to peek. I 
prefer having to maintain a single implementation but if for some reason 
(performance, backwards compatibility, etc.) we want to keep both we could.

> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
>                 Key: CALCITE-5559
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5559
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>
> Currently, RepeatUnion operator with all=false keeps track of the elements 
> that it has returned in order to discard duplicates. However, the TableSpool 
> operators that are right below it do not have such control. In certain 
> scenarios, duplicates are returned by the TableSpool current iteration, 
> discarded by the RepeatUnion, but have been already "fed back" by the 
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates 
> inside/before the TableSpool too (note: we still need to keep track of 
> duplicates at RepeatUnion level, because that is the only place where we can 
> detect a potential "global duplicate" of an element: returned by the LHS and 
> then also by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain 
> queries can go from ~40s down to ~1s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (CALCITE-5559) Improve RepeatUnion by discarding duplicates at TableSpool level

Reply via email to