[
https://issues.apache.org/jira/browse/CALCITE-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17703119#comment-17703119
]
Ruben Q L commented on CALCITE-5559:
------------------------------------
Any feedback on the proposed PR? Could we consider this as a valid optimization
for this experimental feature?
> Improve RepeatUnion by discarding duplicates at TableSpool level
> ----------------------------------------------------------------
>
> Key: CALCITE-5559
> URL: https://issues.apache.org/jira/browse/CALCITE-5559
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Reporter: Ruben Q L
> Assignee: Ruben Q L
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Currently, RepeatUnion operator with all=false keeps track of the elements
> that it has returned in order to discard duplicates. However, the TableSpool
> operators that are right below it do not have such control. In certain
> scenarios, duplicates are returned by the TableSpool current iteration,
> discarded by the RepeatUnion, but have been already "fed back" by the
> TableSpool into the next iteration, causing unnecessary processing.
> We can optimize this scenario by keeping track of the duplicates
> inside/before the TableSpool too (note: we still need to keep track of
> duplicates at RepeatUnion level, because that is the only place where we can
> detect a potential "global duplicate" of an element: returned by the LHS and
> then also by the RHS, or by two different iterations of the RHS).
> A PoC testing this improvement on a downstream project showed that certain
> queries can go from ~40s down to ~1s.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)