Hi Vladimir, Thank you for the link. It is very relevant to my problem. I see that in these discussions, there were several ideas and claims, such as that (1) we can get rid of "simplify" altogether, (2) composite traits are rare in practice, (3) composite traits are not designed well in the first place [1]. I do not have the full picture in my head, so I'll try to share some thoughts to advance the discussion.
Regarding (1), I think that the removal of "simplify" may help with my particular (and pretty simple) test but might lead to some unpredictable results for more complicated queries. Suppose that we generate two equivalent nodes with different traits: [a] and [a][b]. Depending on the nature of the trait def, these two nodes might or might belong to the same subset. For example, [a] and [a][b] are different subsets for RelCollcation. At the same time, [a] and [a][b] could belong to the same subset for some distributions. That is, if the input is hash-distributed by either [a] or [b], it might imply that a==b for every tuple (otherwise, hashes will not match), and therefore every RelNode in the RelSet that is shared by [a] is also sharded by [b] and vice verse. The idea is similar to transitive predicates. So ideally, we should let the RelTraitDef define how to compare composite traits with other traits. Otherwise, we may lose some optimization opportunities. Regarding (2), perhaps the multi-collation nodes are really rare in practice. But nodes with multiple hash distributions are widespread for distributed engines. Because in distributed systems, the collocated hash equijoin is the most common way of joining two inputs, and such join always produces an additional distribution. Regarding (3), it would be very interesting to hear suggestions and ideas on the proper design of composite traits. The composite traits mechanics mentioned in RelSubset Javadoc's is not a good design choice for distribution traits. That is, if we have a node that is distributed by [a][b], we cannot just put it into two subsets [a] and [b], because operator parents may require both [a] and [b], otherwise unnecessary exchanges could appear. That is, [a][b] should be propagated together. For example, the removal of SHARDED[a1] from #1 would add the exchange between #2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange between #3 and #2. Neither is optimal. 3: Aggregate[group=b1] 2: Join[a.a1=c.c1] // SHARDED[a1], SHARDED[b1], SHARDED[c1] 1: Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1] @Haisheng Yuan <[email protected]>, following your comment [1], would you mind providing your ideas around the proper design of composite traits? Are composite traits implemented in Orca? Regards, Vladimir. [1] https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984 вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov <[email protected]>: > >VolcanoPlanner flattens the composite trait into > the default trait value in RelSet.add -> RelTraitSet.simplify > > Vladimir, have you tried removing that RelTraitSet.simplify? > > I remember I have run into that multiple times already, and I suggested > removing that "simplify". > For example, > > https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377 > > > Vladimir >
