[
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17698388#comment-17698388
]
Ruben Q L commented on CALCITE-5559:
------------------------------------
I have created [PR#3101|https://github.com/apache/calcite/pull/3101] which
shows how approach A might look like.
> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
> Key: CALCITE-5559
> URL: https://issues.apache.org/jira/browse/CALCITE-5559
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Reporter: Ruben Q L
> Assignee: Ruben Q L
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently, RepeatUnion operator with all=false keeps track of the elements
> that it has returned in order to discard duplicates. However, the TableSpool
> operators that are right below it do not have such control. In certain
> scenarios, duplicates are returned by the TableSpool current iteration,
> discarded by the RepeatUnion, but have been already "fed back" by the
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates
> inside/before the TableSpool too (note: we still need to keep track of
> duplicates at RepeatUnion level, because that is the only place where we can
> detect a potential "global duplicate" of an element: returned by the LHS and
> then also by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain
> queries can go from ~40s down to ~1s.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)