I will try to find some time the following week to look into the problem/proposal in CALCITE-2593.
I don't like stalling things but if possible let's wait a bit more. There are various parts in the code indicating that the composite traits should not be part of a RelSubset. I was thinking that we should try to maintain this invariant if possible. On Wed, Apr 17, 2019 at 2:48 AM Hongze Zhang <[email protected]> wrote: > You are right, removing the collations could only workaround what causes > us to find the issue on multi-sort. Maybe we'd better not to remove them > (at least currently) since they provide a way to easily test against > composite trait. > > Hongze > > > On Apr 17, 2019, at 01:15, Haisheng Yuan <[email protected]> wrote: > > > >> it looks like if we want to get these problems fixed quickly we can > just remove > >> EnumerableValues's collation emitting. > > > > I am afraid even removing Values collation enumeration won't actually > give it a quick fix, > > because Multi-sorted table, if there is, might still encounter the same > issue with Values. > > > > > > Thanks ~ > > Haisheng Yuan > > ------------------------------------------------------------------ > > 发件人:Hongze Zhang<[email protected]> > > 日 期:2019年04月16日 18:10:17 > > 收件人:<[email protected]> > > 主 题:Re: [DISCUSS] RelCompositeTrait > > > > If we minimize the issue scope to Calcite itself, I think the 3 JIRA > > tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has > > listed (thanks, Haisheng!) are all related to the multi-sorted > > EnumerableValue more or less. An it looks like if we want to get these > > problems fixed quickly we can just remove EnumerableValues's collation > > emitting. I recall that (correct me if I am wrong) the rel is even not > > able to emit descending collations, so I suppose it is not perfect at > > first. > > > > And another discussion is about enumerating traits. IMHO it's hard to > > tell Calcite didn't really try avoiding enumerating them already. The > > methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2] > > already did a job of testing the relationship between traits without > > checking equality. So the whole thing is looking like we already tried > > to not to enumerate them but failed at last. > > > > Regarding the composite traits, one embarrassing thing I can see so far > > is about the method RelTraitSet#simplify[3]. The JavaDoc says the method > > is to "return a trait set similar to this one but with all composite > > traits flattened". But when we look into the related implementations > > > RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5], > > they seem not to flatten anything, the traits just simply get wiped. > > This causes me to worry about if it is really correct that > > RelCollation/RelDistribution extends RelMultipleTrait, because we can't > > leverage the trait simplification but are actually hurt by it. If a rel > > loses it's physical property, we can never prevent from adding > > unnecessary sorts/exchanges. > > > > Besides, even if we decide to add some extra sorts/exchanges, that is > > somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is > > not that smooth to automatically add them. > > > > Overall, regarding these "small" problems, I think none of them is > > really impossible to be solved (yes coming up right solutions may be not > > that straightforward). But of course in future if a brand new design can > > be proposed to improve the entire trait system (such as avoid > > enumerating traits), I think that would be totally a great thing. > > > > Best, > > Hongze > > > > > > [1] > https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118 > > [2] > https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143 > > [3] > https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538 > > [4] > https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60 > > [5] > https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49 > > > > ------ Original Message ------ > > From: "Haisheng Yuan" <[email protected]> > > To: "Jacques Nadeau" <[email protected]>; "Apache Calcite dev list" > > <[email protected]> > > Sent: 2019/4/15 12:27:04 > > Subject: Re: Re: [DISCUSS] RelCompositeTrait > > > >>> There are major challenges with asking for particular traits as well. > >> Imagine a desired aggregate on 7 columns. What does the requestor > request > >> with regards to distribution? All seven columns? One column? Some > >> combination in between? > >> > >> The same challenges exist for enumerating all the traits as well. > Imagine > >> there is an order by the 7 grouping keys on top of the aggregate on 7 > columns, > >> but with different sort direction: > >> select * from foo group by a,b,c... order by c desc, a asc, b desc... > >> What sort order, direction should the sort-based stream aggregate > provide? > >> All ascending, all descending, order (a,b,c...), order(..c,b,a), or all > the combination? > >> All of those enumerated traits are useless except one; for others, > additional > >> sort operator will be needed. > >> > >> Another example is aggregate on top of join, where join on 7 keys, and > aggregate > >> on 2 of the join keys. In distributed system, what distribution trait > would the join > >> operator provide? The 2 grouping keys? All the join keys? All the > combination? > >> > >> Enumerating some/all the deliverable traits, is not prupose driven. All > the traits > >> may be just useless for parent operator. On the other hand, asking the > child > >> operator particular traits, is purpose driven, at least the traits > asked by parent > >> operator are worth consideration, not as wasteful as the former. > >> > >> If I understand RelCompositeTrait's intent correctly, the enumerated > traits, no > >> matter some combination or all combination, should be saved here. But > in fact, > >> it seems not. And as Jacques mentioned, many people rely on RelMetadata > >> operations to pull up the traitsets through operators. > >> > >> This makes me curious and wonder if there are any true use cases or > systems > >> who rely on RelCompositeTrait. If someone has the story, we would love > to hear. > >> > >> Put that aside, even RelCompositeTrait is indispensible, why do we > bother optimizing > >> Values node? For values with several tuples, it is not worth > optimization, with > >> many tuples, it may take more time to enumerate the RelCollation than > just sorting it. > >> Specifically for Values with 0 or 1 tuple, but with many columns, it is > definitely not > >> worth the optimization, because sort removal rule and empty rel removal > rule should > >> do the work. > >> > >> > >> Thanks ~ > >> Haisheng Yuan > >> ------------------------------------------------------------------ > >> 发件人:Jacques Nadeau<[email protected]> > >> 日 期:2019年04月15日 07:36:51 > >> 收件人:<[email protected]> > >> 主 题:Re: [DISCUSS] RelCompositeTrait > >> > >> There are major challenges with asking for particular traits as well. > >> Imagine a desired aggregate on 7 columns. What does the requestor > request > >> with regards to distribution? All seven columns? One column? Some > >> combination in between? The trait system in Calcite is very challenging > to > >> work with because it is up to downstream users to try to figure out > trait > >> propagation outside the core. So challenging, that I believe that many > >> people move to relying on RelMetadata operations since those can be > pulled > >> across several operators at once. > >> > >> It would be great if someone could spend the time to come up with a more > >> global design for these items and we avoid solving one-off problems. > >> Rationalizing when something should be trait, how to avoid trait > planning > >> cost explosion, how to propagate, when something should be handled via > >> RelMetadataQuery, when something should be managed via traits versus > >> materialized view alternatives, etc. > >> > >> An example of overlapping functionality I'd start with is: should > >> multitraits for collation really exist or would exposing these as > >> materialized view alternatives be more appropriate? Why is it necessary > to > >> have a 'shortcut' for this situation while other alternatives don't have > >> one? > >> > >> > >> > >> On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <[email protected]> wrote: > >> > >>> It seemed reasonable when I introduced it, and seems very reasonable, > that > >>> a relational expression (even in the relational model) can have > multiple > >>> physical properties. Consider these questions that the planner might > ask: > >>> > >>> Example 1: > >>> > >>> “Are you sorted on hiredate?” > >>> “Yes” > >>> “Are you sorted on empno?” > >>> “Yes” > >>> “Are you sorted on deptno?” > >>> “No” > >>> > >>> Example 2: > >>> > >>> “Can you fit into less than 100MB of memory?” > >>> “Yes” > >>> “Can you fit into less than 10MB of memory?” > >>> “Yes” > >>> “Can you fit into less than 1MB of memory?” > >>> “No” > >>> > >>> We manage traits like those in example 1 using RelCompositeTrait. We > can’t > >>> handle traits like this in example 2, and so we have trained ourselves > to > >>> not think of “can fit into memory X” as a trait at all. > >>> > >>> Perhaps our mistake is to have an API “tell me all of your traits” > rather > >>> than an API “do you have trait X?”. Asking a RelNode to enumerate its > >>> traits can be painful: the extreme case is an empty Values with 100 > >>> columns; it satisfies any sort order, and there are 100! of these. > >>> > >>> Julian > >>> > >>> > >>> > >>>> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <[email protected]> > >>> wrote: > >>>> > >>>> Hi Haisheng, > >>>> > >>>> Thanks for raising awareness around this topic. I also think we should > >>> try > >>>> to find a solution. > >>>> > >>>> Initially, the Volcano planner was designed to be able to cover > multiple > >>>> models (and not only the relational). For non-relational models > composite > >>>> traits may be indispensable. I don't know if there are people in this > >>> list > >>>> that are using the planner for other models but if there are it would > be > >>>> nice to hear from them. > >>>> > >>>> Focusing exclusively on the relational model, I think composite traits > >>> are > >>>> useful. One use-case that comes to my mind is data replication. It > >>>> perfectly makes sense to partition (distribute) your table on two (or > >>> more) > >>>> columns to be able execute efficiently queries using special partition > >>>> joins. A concrete use-case is RDF data where many distributed systems > >>> store > >>>> the triples table partitioned by subject and object. I guess such > >>> use-cases > >>>> could possibly be modelled in other ways but composite traits is what > >>> comes > >>>> naturally to my mind. > >>>> > >>>> Regarding multi-sorted tables it is not that rare if you import sorted > >>> data > >>>> into a table with an auto-increment primary key for example. > >>>> > >>>> I think all the trait-related issues can be solved if we prioritize > them > >>>> correctly. Apart from Vladimir and Hongze, who already spend quite > some > >>>> time on these, the rest of us should also jump in and try to help. > >>>> > >>>> Best, > >>>> Stamatis > >>>> > >>>> > >>>> > >>>> > >>>> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <[email protected]> > >>> wrote: > >>>> > >>>>> Hi, > >>>>> > >>>>> I found there are some RelCompositeTrait related issues: > >>>>> https://issues.apache.org/jira/browse/CALCITE-2010 > >>>>> https://issues.apache.org/jira/browse/CALCITE-2593 > >>>>> https://issues.apache.org/jira/browse/CALCITE-2764 > >>>>> > >>>>> Multi-sorted table are rare in pratice, mutil-distributed table > doesn't > >>>>> exist either. Values node with several tuples is not worth > optimization, > >>>>> with many tuples is not worth optimization either, because the time > it > >>>>> takes optimizer to figure out the ordering may be longer than just > sort > >>> it > >>>>> in runtime. > >>>>> > >>>>> In issue https://issues.apache.org/jira/browse/CALCITE-1990, > >>>>> Leo extended RelDistribution to inherit RelMultipleTrait, just like > >>>>> RelCollation does, to solve his problem in the example. But I don't > >>> think > >>>>> this is an appropriate way to represent the equivalence classes (in > >>>>> PostgreSQL's term). > >>>>> > >>>>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the > >>>>> beginning? Seems like it gives us more pain than gain. > >>>>> > >>>>> Thanks ~ > >>>>> Haisheng Yuan > >>>>> > >>> > >>> > >> > >
