[ 
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17696956#comment-17696956
 ] 

Ruben Q L commented on CALCITE-5559:
------------------------------------

Thanks for the input [~zabetak], indeed that seems a valid, probably more 
natural approach., I'll explore it.
One way or the other, the goal will be to remove the duplicates "at a lower 
level" than the RepeatUnion.

> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
>                 Key: CALCITE-5559
>                 URL: https://issues.apache.org/jira/browse/CALCITE-5559
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>            Reporter: Ruben Q L
>            Assignee: Ruben Q L
>            Priority: Major
>
> Currently, RepeatUnion operator with all=false keeps track of the elements 
> that it has returned in order to discard duplicates. However, the TableSpool 
> operators that are right below it do not have such control. In certain 
> scenarios, duplicates are returned by the TableSpool current iteration, 
> discarded by the RepeatUnion, but have been already "fed back" by the 
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates inside the 
> TableSpool too (note: we still need to keep track of duplicates at 
> RepeatUnion level, because that is the only place where we can detect a 
> potential "global duplicate" of an element: returned by the LHS and then also 
> by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain 
> queries can go from ~40s down to ~1s.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to