Re: [DISCUSS] RelCompositeTrait

Stamatis Zampetakis Sat, 20 Apr 2019 00:17:42 -0700

I will try to find some time the following week to look into the
problem/proposal in CALCITE-2593.


I don't like stalling things but if possible let's wait a bit more. There
are various parts in the code indicating that the composite traits
should not be part of a RelSubset. I was thinking that we should try to
maintain this invariant if possible.

On Wed, Apr 17, 2019 at 2:48 AM Hongze Zhang <[email protected]> wrote:

> You are right, removing the collations could only workaround what causes
> us to find the issue on multi-sort. Maybe we'd better not to remove them
> (at least currently) since they provide a way to easily test against
> composite trait.
>
> Hongze
>
> > On Apr 17, 2019, at 01:15, Haisheng Yuan <[email protected]> wrote:
> >
> >> it looks like if we want to get these problems fixed quickly we can
> just remove
> >> EnumerableValues's collation emitting.
> >
> > I am afraid even removing Values collation enumeration won't actually
> give it a quick fix,
> > because Multi-sorted table, if there is, might still encounter the same
> issue with Values.
> >
> >
> > Thanks ~
> > Haisheng Yuan
> > ------------------------------------------------------------------
> > 发件人：Hongze Zhang<[email protected]>
> > 日 期：2019年04月16日 18:10:17
> > 收件人：<[email protected]>
> > 主 题：Re: [DISCUSS] RelCompositeTrait
> >
> > If we minimize the issue scope to Calcite itself, I think the 3 JIRA
> > tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has
> > listed (thanks, Haisheng!) are all related to the multi-sorted
> > EnumerableValue more or less. An it looks like if we want to get these
> > problems fixed quickly we can just remove EnumerableValues's collation
> > emitting. I recall that (correct me if I am wrong) the rel is even not
> > able to emit descending collations, so I suppose it is not perfect at
> > first.
> >
> > And another discussion is about enumerating traits. IMHO it's hard to
> > tell Calcite didn't really try avoiding enumerating them already. The
> > methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2]
> > already did a job of testing the relationship between traits without
> > checking equality. So the whole thing is looking like we already tried
> > to not to enumerate them but failed at last.
> >
> > Regarding the composite traits, one embarrassing thing I can see so far
> > is about the method RelTraitSet#simplify[3]. The JavaDoc says the method
> > is to "return a trait set similar to this one but with all composite
> > traits flattened". But when we look into the related implementations
> >
> RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5],
> > they seem not to flatten anything, the traits just simply get wiped.
> > This causes me to worry about if it is really correct that
> > RelCollation/RelDistribution extends RelMultipleTrait, because we can't
> > leverage the trait simplification but are actually hurt by it. If a rel
> > loses it's physical property, we can never prevent from adding
> > unnecessary sorts/exchanges.
> >
> > Besides, even if we decide to add some extra sorts/exchanges, that is
> > somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is
> > not that smooth to automatically add them.
> >
> > Overall, regarding these "small" problems, I think none of them is
> > really impossible to be solved (yes coming up right solutions may be not
> > that straightforward). But of course in future if a brand new design can
> > be proposed to improve the entire trait system (such as avoid
> > enumerating traits), I think that would be totally a great thing.
> >
> > Best,
> > Hongze
> >
> >
> > [1]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118
> > [2]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143
> > [3]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538
> > [4]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60
> > [5]
> https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49
> >
> > ------ Original Message ------
> > From: "Haisheng Yuan" <[email protected]>
> > To: "Jacques Nadeau" <[email protected]>; "Apache Calcite dev list"
> > <[email protected]>
> > Sent: 2019/4/15 12:27:04
> > Subject: Re: Re: [DISCUSS] RelCompositeTrait
> >
> >>> There are major challenges with asking for particular traits as well.
> >> Imagine a desired aggregate on 7 columns. What does the requestor
> request
> >> with regards to distribution? All seven columns? One column? Some
> >> combination in between?
> >>
> >> The same challenges exist for enumerating all the traits as well.
> Imagine
> >> there is an order by the 7 grouping keys on top of the aggregate on 7
> columns,
> >> but with different sort direction:
> >> select * from foo group by a,b,c... order by c desc, a asc, b desc...
> >> What sort order, direction should the sort-based stream aggregate
> provide?
> >> All ascending, all descending, order (a,b,c...), order(..c,b,a), or all
> the combination?
> >> All of those enumerated traits are useless except one; for others,
> additional
> >> sort operator will be needed.
> >>
> >> Another example is aggregate on top of join, where join on 7 keys, and
> aggregate
> >> on 2 of the join keys. In distributed system, what distribution trait
> would the join
> >> operator provide? The 2 grouping keys? All the join keys? All the
> combination?
> >>
> >> Enumerating some/all the deliverable traits, is not prupose driven. All
> the traits
> >> may be just useless for parent operator. On the other hand, asking the
> child
> >> operator particular traits, is purpose driven, at least the traits
> asked by parent
> >> operator are worth consideration, not as wasteful as the former.
> >>
> >> If I understand RelCompositeTrait's intent correctly, the enumerated
> traits, no
> >> matter some combination or all combination, should be saved here. But
> in fact,
> >> it seems not. And as Jacques mentioned, many people rely on RelMetadata
> >> operations to pull up the traitsets through operators.
> >>
> >> This makes me curious and wonder if there are any true use cases or
> systems
> >> who rely on RelCompositeTrait. If someone has the story, we would love
> to hear.
> >>
> >> Put that aside, even RelCompositeTrait is indispensible, why do we
> bother optimizing
> >> Values node? For values with several tuples, it is not worth
> optimization, with
> >> many tuples, it may take more time to enumerate the RelCollation than
> just sorting it.
> >> Specifically for Values with 0 or 1 tuple, but with many columns, it is
> definitely not
> >> worth the optimization, because sort removal rule and empty rel removal
> rule should
> >> do the work.
> >>
> >>
> >> Thanks ~
> >> Haisheng Yuan
> >> ------------------------------------------------------------------
> >> 发件人：Jacques Nadeau<[email protected]>
> >> 日 期：2019年04月15日 07:36:51
> >> 收件人：<[email protected]>
> >> 主 题：Re: [DISCUSS] RelCompositeTrait
> >>
> >> There are major challenges with asking for particular traits as well.
> >> Imagine a desired aggregate on 7 columns. What does the requestor
> request
> >> with regards to distribution? All seven columns? One column? Some
> >> combination in between? The trait system in Calcite is very challenging
> to
> >> work with because it is up to downstream users to try to figure out
> trait
> >> propagation outside the core. So challenging, that I believe that many
> >> people move to relying on RelMetadata operations since those can be
> pulled
> >> across several operators at once.
> >>
> >> It would be great if someone could spend the time to come up with a more
> >> global design for these items and we avoid solving one-off problems.
> >> Rationalizing when something should be trait, how to avoid trait
> planning
> >> cost explosion, how to propagate, when something should be handled via
> >> RelMetadataQuery, when something should be managed via traits versus
> >> materialized view alternatives, etc.
> >>
> >> An example of overlapping functionality I'd start with is: should
> >> multitraits for collation really exist or would exposing these as
> >> materialized view alternatives be more appropriate? Why is it necessary
> to
> >> have a 'shortcut' for this situation while other alternatives don't have
> >> one?
> >>
> >>
> >>
> >> On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <[email protected]> wrote:
> >>
> >>> It seemed reasonable when I introduced it, and seems very reasonable,
> that
> >>> a relational expression (even in the relational model) can have
> multiple
> >>> physical properties. Consider these questions that the planner might
> ask:
> >>>
> >>> Example 1:
> >>>
> >>> “Are you sorted on hiredate?”
> >>> “Yes”
> >>> “Are you sorted on empno?”
> >>> “Yes”
> >>> “Are you sorted on deptno?”
> >>> “No”
> >>>
> >>> Example 2:
> >>>
> >>> “Can you fit into less than 100MB of memory?”
> >>> “Yes”
> >>> “Can you fit into less than 10MB of memory?”
> >>> “Yes”
> >>> “Can you fit into less than 1MB of memory?”
> >>> “No”
> >>>
> >>> We manage traits like those in example 1 using RelCompositeTrait. We
> can’t
> >>> handle traits like this in example 2, and so we have trained ourselves
> to
> >>> not think of “can fit into memory X” as a trait at all.
> >>>
> >>> Perhaps our mistake is to have an API “tell me all of your traits”
> rather
> >>> than an API “do you have trait X?”. Asking a RelNode to enumerate its
> >>> traits can be painful: the extreme case is an empty Values with 100
> >>> columns; it satisfies any sort order, and there are 100! of these.
> >>>
> >>> Julian
> >>>
> >>>
> >>>
> >>>> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <[email protected]>
> >>> wrote:
> >>>>
> >>>> Hi Haisheng,
> >>>>
> >>>> Thanks for raising awareness around this topic. I also think we should
> >>> try
> >>>> to find a solution.
> >>>>
> >>>> Initially, the Volcano planner was designed to be able to cover
> multiple
> >>>> models (and not only the relational). For non-relational models
> composite
> >>>> traits may be indispensable. I don't know if there are people in this
> >>> list
> >>>> that are using the planner for other models but if there are it would
> be
> >>>> nice to hear from them.
> >>>>
> >>>> Focusing exclusively on the relational model, I think composite traits
> >>> are
> >>>> useful. One use-case that comes to my mind is data replication. It
> >>>> perfectly makes sense to partition (distribute) your table on two (or
> >>> more)
> >>>> columns to be able execute efficiently queries using special partition
> >>>> joins. A concrete use-case is RDF data where many distributed systems
> >>> store
> >>>> the triples table partitioned by subject and object. I guess such
> >>> use-cases
> >>>> could possibly be modelled in other ways but composite traits is what
> >>> comes
> >>>> naturally to my mind.
> >>>>
> >>>> Regarding multi-sorted tables it is not that rare if you import sorted
> >>> data
> >>>> into a table with an auto-increment primary key for example.
> >>>>
> >>>> I think all the trait-related issues can be solved if we prioritize
> them
> >>>> correctly. Apart from Vladimir and Hongze, who already spend quite
> some
> >>>> time on these, the rest of us should also jump in and try to help.
> >>>>
> >>>> Best,
> >>>> Stamatis
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <[email protected]>
> >>> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I found there are some RelCompositeTrait related issues:
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2010
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2593
> >>>>> https://issues.apache.org/jira/browse/CALCITE-2764
> >>>>>
> >>>>> Multi-sorted table are rare in pratice, mutil-distributed table
> doesn't
> >>>>> exist either. Values node with several tuples is not worth
> optimization,
> >>>>> with many tuples is not worth optimization either, because the time
> it
> >>>>> takes optimizer to figure out the ordering may be longer than just
> sort
> >>> it
> >>>>> in runtime.
> >>>>>
> >>>>> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
> >>>>> Leo extended RelDistribution to inherit RelMultipleTrait, just like
> >>>>> RelCollation does, to solve his problem in the example. But I don't
> >>> think
> >>>>> this is an appropriate way to represent the equivalence classes (in
> >>>>> PostgreSQL's term).
> >>>>>
> >>>>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
> >>>>> beginning? Seems like it gives us more pain than gain.
> >>>>>
> >>>>> Thanks ~
> >>>>> Haisheng Yuan
> >>>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] RelCompositeTrait

Reply via email to