[
https://issues.apache.org/jira/browse/CALCITE-3820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ruben Q L updated CALCITE-3820:
-------------------------------
Fix Version/s: 1.23.0
> EnumerableDefaults#orderBy should be lazily computed + support enumerator
> re-initialization
> -------------------------------------------------------------------------------------------
>
> Key: CALCITE-3820
> URL: https://issues.apache.org/jira/browse/CALCITE-3820
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.21.0
> Reporter: Ruben Q L
> Assignee: Ruben Q L
> Priority: Major
> Labels: pull-request-available
> Fix For: 1.23.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Currently, {{EnumerableDefaults#orderBy}} (which is the actual implementation
> behind EnumerableSort) looks like this:
> {code:java}
> public static <TSource, TKey> Enumerable<TSource> orderBy(
> Enumerable<TSource> source, Function1<TSource, TKey> keySelector,
> Comparator<TKey> comparator) {
> final Map<TKey, List<TSource>> map = new TreeMap<>(comparator);
> LookupImpl<TKey, TSource> lookup = toLookup_(map, source, keySelector,
> Functions.identitySelector());
> return lookup.valuesEnumerable();
> }
> {code}
> This implementation has the following disadvantages:
> - 1) The TreeMap that performs the actual sorting is eagerly computed. This
> operation can be quite heavy if we have a very big source. The fact that it
> is eagerly computed might lead to bad performance in situations where the
> sort is executed, but it is actually not needed. For example, in a
> NestedLoopJoin, if the outer enumerator returns no results, the inner one is
> not even accessed. If we had a huge orderBy as inner, today it would be
> computed in that scenario, even though it is not actually required. For this
> reason, in terms of performance it seems clearly preferable to delay the sort
> operation as much as possible.
> - 2) The Enumerable / Enumerator returned by {{EnumerableDefaults#orderBy}}
> cannot be re-evaluated. Since the map and the subsequent {{LookupImpl}} that
> the Enumerable relies on are eagerly computed, every called to {{enumerator}}
> will return the same values, *even if the source Enumerable has changed*.
> This is a corner case, but we can see it if, for example, we have a
> MergeJoin, with EnumerableSort (i.e. using {{EnumerableDefaults#orderBy}})
> inside a RepeatUnion (a.k.a. recursive union). In this situation, the
> RepeatUnion re-evaluates its right child, so that the output of the iteration
> N-1 is used as input for the iteration N. In this scenario, the
> {{EnumerableDefaults#orderBy}} will not work as expected, since it will not
> be actually re-computed (see {{EnumerableMethods#repeatUnion}})
--
This message was sent by Atlassian Jira
(v8.3.4#803005)