[
https://issues.apache.org/jira/browse/CALCITE-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271621#comment-17271621
]
Vladimir Sitnikov commented on CALCITE-3221:
--------------------------------------------
Issue description says:
{quote}Currently, the union operation offered by Calcite is based on a HashSet
(see EnumerableDefaults.union) and necessitates reading in memory all rows
before returning a single result.{quote}
I do not agree with that, since "necessitates reading in memory" is an
implementation detail (missing optimization) rather than something fundamental.
For instance, the following implementation would be able to return rows as soon
as a new unique row is detected. Note: the worst-case memory consumption of
this implementation is exactly the same as the consumption of the suggested
"sort-merge union".
{code:java}
public static <TSource> Enumerable<TSource> union(Enumerable<TSource> source0,
Enumerable<TSource> source1) {
Enumerable<TSource> unionAll = concat(source0, source1);
return new AbstractEnumerable<TSource>() {
@Override public Enumerator<TSource> enumerator() {
Set<TSource> set = new HashSet<>();
return EnumerableDefaults.where(unionAll, set::add).enumerator();
}
};
}
{code}
{quote}Apart from increased memory consumption the operator is blocking and
also destroys the order of its inputs.{quote}
That is an implementation detail as well.
The above improvement makes the operator non-blocking, and it keeps the order
of the rows exactly.
> Add a sort-merge union algorithm
> --------------------------------
>
> Key: CALCITE-3221
> URL: https://issues.apache.org/jira/browse/CALCITE-3221
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.19.0
> Reporter: Stamatis Zampetakis
> Assignee: Ruben Q L
> Priority: Minor
> Labels: pull-request-available
> Attachments: screenshot-1.png
>
> Time Spent: 10h 50m
> Remaining Estimate: 0h
>
> Currently, the union operation offered by Calcite is based on a {{HashSet}}
> (see
> [EnumerableDefaults.union|https://github.com/apache/calcite/blob/d98856bf1a5f5c151d004b769e14bdd368a67234/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L2747])
> and necessitates reading in memory all rows before returning a single
> result.
> Apart from increased memory consumption the operator is blocking and also
> destroys the order of its inputs.
> The goal of this issue is to add a new union algorithm (EnumerableMergeUnion
> ?) exploiting the fact that the inputs are sorted which consumes less memory
> and retains the order of its inputs.
> Most likely the implementation of the merge join can be useful.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)