[jira] [Commented] (CALCITE-3221) Add a sort-merge union algorithm

Vladimir Sitnikov (Jira) Mon, 25 Jan 2021 11:31:08 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-3221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17271621#comment-17271621
 ]


Vladimir Sitnikov commented on CALCITE-3221:
--------------------------------------------

Issue description says:
{quote}Currently, the union operation offered by Calcite is based on a HashSet 
(see EnumerableDefaults.union) and necessitates reading in memory all rows 
before returning a single result.{quote}

I do not agree with that, since "necessitates reading in memory" is an 
implementation detail (missing optimization) rather than something fundamental.

For instance, the following implementation would be able to return rows as soon 
as a new unique row is detected. Note: the worst-case memory consumption of 
this implementation is exactly the same as the consumption of the suggested 
"sort-merge union".

{code:java}
  public static <TSource> Enumerable<TSource> union(Enumerable<TSource> source0,
      Enumerable<TSource> source1) {
    Enumerable<TSource> unionAll = concat(source0, source1);
    return new AbstractEnumerable<TSource>() {
      @Override public Enumerator<TSource> enumerator() {
        Set<TSource> set = new HashSet<>();
        return EnumerableDefaults.where(unionAll, set::add).enumerator();
      }
    };
  }
{code}

{quote}Apart from increased memory consumption the operator is blocking and 
also destroys the order of its inputs.{quote}
That is an implementation detail as well.
The above improvement makes the operator non-blocking, and it keeps the order 
of the rows exactly.

> Add a sort-merge union algorithm
> --------------------------------
>
>                 Key: CALCITE-3221
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3221
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.19.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Ruben Q L
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: screenshot-1.png
>
>          Time Spent: 10h 50m
>  Remaining Estimate: 0h
>
> Currently, the union operation offered by Calcite is based on a {{HashSet}} 
> (see 
> [EnumerableDefaults.union|https://github.com/apache/calcite/blob/d98856bf1a5f5c151d004b769e14bdd368a67234/linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java#L2747])
>  and necessitates reading in memory all rows before returning a single 
> result.   
> Apart from increased memory consumption the operator is blocking and also 
> destroys the order of its inputs.  
> The goal of this issue is to add a new union algorithm (EnumerableMergeUnion 
> ?) exploiting the fact that the inputs are sorted which consumes less memory 
> and retains the order of its inputs.   
> Most likely the implementation of the merge join can be useful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CALCITE-3221) Add a sort-merge union algorithm

Reply via email to