[
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated CALCITE-5559:
------------------------------------
Labels: pull-request-available (was: )
> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
> Key: CALCITE-5559
> URL: https://issues.apache.org/jira/browse/CALCITE-5559
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Reporter: Ruben Q L
> Assignee: Ruben Q L
> Priority: Major
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently, RepeatUnion operator with all=false keeps track of the elements
> that it has returned in order to discard duplicates. However, the TableSpool
> operators that are right below it do not have such control. In certain
> scenarios, duplicates are returned by the TableSpool current iteration,
> discarded by the RepeatUnion, but have been already "fed back" by the
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates
> inside/before the TableSpool too (note: we still need to keep track of
> duplicates at RepeatUnion level, because that is the only place where we can
> detect a potential "global duplicate" of an element: returned by the LHS and
> then also by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain
> queries can go from ~40s down to ~1s.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)