[ 
https://issues.apache.org/jira/browse/CALCITE-2648?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16734082#comment-16734082
 ] 

Vladimir Sitnikov commented on CALCITE-2648:
--------------------------------------------

[~julianhyde], I'm looking into ignored 
PlanerTest#testDuplicateSortPlanWithOver   
[https://github.com/apache/calcite/blob/b54f6de9d7f87e9853fc9ec01b586555a089b913/core/src/test/java/org/apache/calcite/tools/PlannerTest.java#L442-L445]

 

TL;DR:

a) I suggest to update LogicalWindow's collation: it should produce both 
collation and distribution. For instance, over (partition by deptno order by 
name) should produce distribution=hash, collation=[name]

b) I suggest to update costing for LogicalWindow (and EnumerableWindow) so it 
excludes sorting cost in case input collation matches the collation of the 
window. For instance, if input is sorted by name, then (partition by deptno 
order by name) can leverage that sort order and it basically does not need to 
sort. Note: we can even keep current Arrays.sort in the codebase, and it would 
just be faster (since TimSort is faster for sorted inputs)

 

c) LogicalProject that has OVER expressions still thinks it keeps collation 
trait like regular project. I'm inclined to resort to "empty" collation in case 
project has OVER expressions.

 

 

The root cause for testDuplicateSortPlanWithOver failure is as follows:

1) Current Calcite code assumes LogicalWindow "does not re-order its input 
rows". It does look strange since CalcRelSplitter is always using the same 
traitset for all the generated LogicalWindows: 
[https://github.com/apache/calcite/blob/d59b639d27da704f00eff616324a2c04aa06f84c/core/src/main/java/org/apache/calcite/rel/rules/CalcRelSplitter.java#L229]

In other words, it takes the traitSet (i.e. collation of the original calc), 
and assumes each and every produced LogicalWindow would have the same collation.

Note: traitSet for LogicalWindow is used AS IS. It is not even re-mapped to 
account projects or whatever. In other words, LogicalWindow asks for an unknown 
input.

2) RelMdCollation.window assumes the window does not reorder rows. Of course it 
is possible, however it would often end up with an additional sort after 
computation of the window aggregate.

Note: it is not clear which collation should be for {{sum(salary) over 
(partition by depno order by name)}}

Implementation-wise, enumerable uses {{SortedMultiMap}} which is {{HashMap<Key, 
List<Row>>}}

In other words, the produced rows are distributed by hash(deptno), and ordered 
by empno within group. I guess traitSet of "NONE.HASH_DISTRIBUTED.[name]" 
should be a reasonable default for the above {{over}} expression.

 

 

 

> Output collation of EnumerableWindow is not consistent with its implementation
> ------------------------------------------------------------------------------
>
>                 Key: CALCITE-2648
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2648
>             Project: Calcite
>          Issue Type: Bug
>    Affects Versions: 1.17.0
>            Reporter: Hongze Zhang
>            Assignee: Julian Hyde
>            Priority: Major
>
> Here is a case:
> {code:sql}
> select x, COUNT(*) OVER (PARTITION BY x) from (values (20), (35)) as t(x) 
> ORDER BY x
> {code}
> Final plan:
> {code:java}
> EnumerableWindow(window#0=[window(partition {0} order by [] range between 
> UNBOUNDED PRECEDING and UNBOUNDED FOLLOWING aggs [COUNT()])])
>   EnumerableValues(tuples=[[{ 20 }, { 35 }]])
> {code}
> Output rows:
> {code:java}
> X  |EXPR$1 |
> ---|-------|
> 35 |1      |
> 20 |1      |
> {code}
> EnumerableWindow is supposed to preserve input collations, as a result 
> EnumerableSort is ignored. However the implementation of EnumerableWindow 
> generates non-ordered output (when PARTITION BY clause is used).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to