You are right, removing the collations could only workaround what causes us to find the issue on multi-sort. Maybe we'd better not to remove them (at least currently) since they provide a way to easily test against composite trait.
Hongze > On Apr 17, 2019, at 01:15, Haisheng Yuan <h.y...@alibaba-inc.com> wrote: > >> it looks like if we want to get these problems fixed quickly we can just >> remove >> EnumerableValues's collation emitting. > > I am afraid even removing Values collation enumeration won't actually give it > a quick fix, > because Multi-sorted table, if there is, might still encounter the same issue > with Values. > > > Thanks ~ > Haisheng Yuan > ------------------------------------------------------------------ > 发件人:Hongze Zhang<notify...@126.com> > 日 期:2019年04月16日 18:10:17 > 收件人:<dev@calcite.apache.org> > 主 题:Re: [DISCUSS] RelCompositeTrait > > If we minimize the issue scope to Calcite itself, I think the 3 JIRA > tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has > listed (thanks, Haisheng!) are all related to the multi-sorted > EnumerableValue more or less. An it looks like if we want to get these > problems fixed quickly we can just remove EnumerableValues's collation > emitting. I recall that (correct me if I am wrong) the rel is even not > able to emit descending collations, so I suppose it is not perfect at > first. > > And another discussion is about enumerating traits. IMHO it's hard to > tell Calcite didn't really try avoiding enumerating them already. The > methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2] > already did a job of testing the relationship between traits without > checking equality. So the whole thing is looking like we already tried > to not to enumerate them but failed at last. > > Regarding the composite traits, one embarrassing thing I can see so far > is about the method RelTraitSet#simplify[3]. The JavaDoc says the method > is to "return a trait set similar to this one but with all composite > traits flattened". But when we look into the related implementations > RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5], > they seem not to flatten anything, the traits just simply get wiped. > This causes me to worry about if it is really correct that > RelCollation/RelDistribution extends RelMultipleTrait, because we can't > leverage the trait simplification but are actually hurt by it. If a rel > loses it's physical property, we can never prevent from adding > unnecessary sorts/exchanges. > > Besides, even if we decide to add some extra sorts/exchanges, that is > somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is > not that smooth to automatically add them. > > Overall, regarding these "small" problems, I think none of them is > really impossible to be solved (yes coming up right solutions may be not > that straightforward). But of course in future if a brand new design can > be proposed to improve the entire trait system (such as avoid > enumerating traits), I think that would be totally a great thing. > > Best, > Hongze > > > [1]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118 > [2]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143 > [3]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538 > [4]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60 > [5]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49 > > ------ Original Message ------ > From: "Haisheng Yuan" <h.y...@alibaba-inc.com> > To: "Jacques Nadeau" <jacq...@apache.org>; "Apache Calcite dev list" > <dev@calcite.apache.org> > Sent: 2019/4/15 12:27:04 > Subject: Re: Re: [DISCUSS] RelCompositeTrait > >>> There are major challenges with asking for particular traits as well. >> Imagine a desired aggregate on 7 columns. What does the requestor request >> with regards to distribution? All seven columns? One column? Some >> combination in between? >> >> The same challenges exist for enumerating all the traits as well. Imagine >> there is an order by the 7 grouping keys on top of the aggregate on 7 >> columns, >> but with different sort direction: >> select * from foo group by a,b,c... order by c desc, a asc, b desc... >> What sort order, direction should the sort-based stream aggregate provide? >> All ascending, all descending, order (a,b,c...), order(..c,b,a), or all the >> combination? >> All of those enumerated traits are useless except one; for others, additional >> sort operator will be needed. >> >> Another example is aggregate on top of join, where join on 7 keys, and >> aggregate >> on 2 of the join keys. In distributed system, what distribution trait would >> the join >> operator provide? The 2 grouping keys? All the join keys? All the >> combination? >> >> Enumerating some/all the deliverable traits, is not prupose driven. All the >> traits >> may be just useless for parent operator. On the other hand, asking the child >> operator particular traits, is purpose driven, at least the traits asked by >> parent >> operator are worth consideration, not as wasteful as the former. >> >> If I understand RelCompositeTrait's intent correctly, the enumerated traits, >> no >> matter some combination or all combination, should be saved here. But in >> fact, >> it seems not. And as Jacques mentioned, many people rely on RelMetadata >> operations to pull up the traitsets through operators. >> >> This makes me curious and wonder if there are any true use cases or systems >> who rely on RelCompositeTrait. If someone has the story, we would love to >> hear. >> >> Put that aside, even RelCompositeTrait is indispensible, why do we bother >> optimizing >> Values node? For values with several tuples, it is not worth optimization, >> with >> many tuples, it may take more time to enumerate the RelCollation than just >> sorting it. >> Specifically for Values with 0 or 1 tuple, but with many columns, it is >> definitely not >> worth the optimization, because sort removal rule and empty rel removal rule >> should >> do the work. >> >> >> Thanks ~ >> Haisheng Yuan >> ------------------------------------------------------------------ >> 发件人:Jacques Nadeau<jacq...@apache.org> >> 日 期:2019年04月15日 07:36:51 >> 收件人:<dev@calcite.apache.org> >> 主 题:Re: [DISCUSS] RelCompositeTrait >> >> There are major challenges with asking for particular traits as well. >> Imagine a desired aggregate on 7 columns. What does the requestor request >> with regards to distribution? All seven columns? One column? Some >> combination in between? The trait system in Calcite is very challenging to >> work with because it is up to downstream users to try to figure out trait >> propagation outside the core. So challenging, that I believe that many >> people move to relying on RelMetadata operations since those can be pulled >> across several operators at once. >> >> It would be great if someone could spend the time to come up with a more >> global design for these items and we avoid solving one-off problems. >> Rationalizing when something should be trait, how to avoid trait planning >> cost explosion, how to propagate, when something should be handled via >> RelMetadataQuery, when something should be managed via traits versus >> materialized view alternatives, etc. >> >> An example of overlapping functionality I'd start with is: should >> multitraits for collation really exist or would exposing these as >> materialized view alternatives be more appropriate? Why is it necessary to >> have a 'shortcut' for this situation while other alternatives don't have >> one? >> >> >> >> On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote: >> >>> It seemed reasonable when I introduced it, and seems very reasonable, that >>> a relational expression (even in the relational model) can have multiple >>> physical properties. Consider these questions that the planner might ask: >>> >>> Example 1: >>> >>> “Are you sorted on hiredate?” >>> “Yes” >>> “Are you sorted on empno?” >>> “Yes” >>> “Are you sorted on deptno?” >>> “No” >>> >>> Example 2: >>> >>> “Can you fit into less than 100MB of memory?” >>> “Yes” >>> “Can you fit into less than 10MB of memory?” >>> “Yes” >>> “Can you fit into less than 1MB of memory?” >>> “No” >>> >>> We manage traits like those in example 1 using RelCompositeTrait. We can’t >>> handle traits like this in example 2, and so we have trained ourselves to >>> not think of “can fit into memory X” as a trait at all. >>> >>> Perhaps our mistake is to have an API “tell me all of your traits” rather >>> than an API “do you have trait X?”. Asking a RelNode to enumerate its >>> traits can be painful: the extreme case is an empty Values with 100 >>> columns; it satisfies any sort order, and there are 100! of these. >>> >>> Julian >>> >>> >>> >>>> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <zabe...@gmail.com> >>> wrote: >>>> >>>> Hi Haisheng, >>>> >>>> Thanks for raising awareness around this topic. I also think we should >>> try >>>> to find a solution. >>>> >>>> Initially, the Volcano planner was designed to be able to cover multiple >>>> models (and not only the relational). For non-relational models composite >>>> traits may be indispensable. I don't know if there are people in this >>> list >>>> that are using the planner for other models but if there are it would be >>>> nice to hear from them. >>>> >>>> Focusing exclusively on the relational model, I think composite traits >>> are >>>> useful. One use-case that comes to my mind is data replication. It >>>> perfectly makes sense to partition (distribute) your table on two (or >>> more) >>>> columns to be able execute efficiently queries using special partition >>>> joins. A concrete use-case is RDF data where many distributed systems >>> store >>>> the triples table partitioned by subject and object. I guess such >>> use-cases >>>> could possibly be modelled in other ways but composite traits is what >>> comes >>>> naturally to my mind. >>>> >>>> Regarding multi-sorted tables it is not that rare if you import sorted >>> data >>>> into a table with an auto-increment primary key for example. >>>> >>>> I think all the trait-related issues can be solved if we prioritize them >>>> correctly. Apart from Vladimir and Hongze, who already spend quite some >>>> time on these, the rest of us should also jump in and try to help. >>>> >>>> Best, >>>> Stamatis >>>> >>>> >>>> >>>> >>>> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h.y...@alibaba-inc.com> >>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I found there are some RelCompositeTrait related issues: >>>>> https://issues.apache.org/jira/browse/CALCITE-2010 >>>>> https://issues.apache.org/jira/browse/CALCITE-2593 >>>>> https://issues.apache.org/jira/browse/CALCITE-2764 >>>>> >>>>> Multi-sorted table are rare in pratice, mutil-distributed table doesn't >>>>> exist either. Values node with several tuples is not worth optimization, >>>>> with many tuples is not worth optimization either, because the time it >>>>> takes optimizer to figure out the ordering may be longer than just sort >>> it >>>>> in runtime. >>>>> >>>>> In issue https://issues.apache.org/jira/browse/CALCITE-1990, >>>>> Leo extended RelDistribution to inherit RelMultipleTrait, just like >>>>> RelCollation does, to solve his problem in the example. But I don't >>> think >>>>> this is an appropriate way to represent the equivalence classes (in >>>>> PostgreSQL's term). >>>>> >>>>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the >>>>> beginning? Seems like it gives us more pain than gain. >>>>> >>>>> Thanks ~ >>>>> Haisheng Yuan >>>>> >>> >>> >>