Hi Vladimir,

Thank you for the link. It is very relevant to my problem. I see that in
these discussions, there were several ideas and claims, such as that (1) we
can get rid of "simplify" altogether, (2) composite traits are rare in
practice, (3) composite traits are not designed well in the first place
[1]. I do not have the full picture in my head, so I'll try to share some
thoughts to advance the discussion.

Regarding (1), I think that the removal of "simplify" may help with my
particular (and pretty simple) test but might lead to some unpredictable
results for more complicated queries. Suppose that we generate two
equivalent nodes with different traits: [a] and [a][b]. Depending on the
nature of the trait def, these two nodes might or might belong to the same
subset. For example, [a] and [a][b] are different subsets for
RelCollcation. At the same time, [a] and [a][b] could belong to the same
subset for some distributions. That is, if the input is hash-distributed by
either [a] or [b], it might imply that a==b for every tuple (otherwise,
hashes will not match), and therefore every RelNode in the RelSet that is
shared by [a] is also sharded by [b] and vice verse. The idea is similar to
transitive predicates. So ideally, we should let the RelTraitDef define how
to compare composite traits with other traits. Otherwise, we may lose some
optimization opportunities.

Regarding (2), perhaps the multi-collation nodes are really rare in
practice. But nodes with multiple hash distributions are widespread for
distributed engines. Because in distributed systems, the collocated hash
equijoin is the most common way of joining two inputs, and such join always
produces an additional distribution.

Regarding (3), it would be very interesting to hear suggestions and ideas
on the proper design of composite traits. The composite traits mechanics
mentioned in RelSubset Javadoc's is not a good design choice for
distribution traits. That is, if we have a node that is distributed by
[a][b], we cannot just put it into two subsets [a] and [b], because
operator parents may require both [a] and [b], otherwise unnecessary
exchanges could appear. That is, [a][b] should be propagated together. For
example, the removal of SHARDED[a1] from #1 would add the exchange between
#2 and #1, and the removal of SHARDED[b1] from #1 would add the exchange
between #3 and #2. Neither is optimal.
3: Aggregate[group=b1]
2:   Join[a.a1=c.c1]   // SHARDED[a1], SHARDED[b1], SHARDED[c1]
1:     Join[a.a1=b.b1] // SHARDED[a1], SHARDED[b1]

@Haisheng Yuan <[email protected]>, following your comment [1], would
you mind providing your ideas around the proper design of composite traits?
Are composite traits implemented in Orca?

Regards,
Vladimir.

[1]
https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=17081984&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17081984

вт, 25 мая 2021 г. в 19:32, Vladimir Sitnikov <[email protected]>:

> >VolcanoPlanner flattens the composite trait into
> the default trait value in RelSet.add -> RelTraitSet.simplify
>
> Vladimir, have you tried removing that RelTraitSet.simplify?
>
> I remember I have run into that multiple times already, and I suggested
> removing that "simplify".
> For example,
>
> https://issues.apache.org/jira/browse/CALCITE-2593?focusedCommentId=16750377&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16750377
>
>
> Vladimir
>

Reply via email to