Re: [DISCUSS] RelCompositeTrait

Hongze Zhang Tue, 16 Apr 2019 17:49:26 -0700

You are right, removing the collations could only workaround what causes us to 
find the issue on multi-sort. Maybe we'd better not to remove them (at least 
currently) since they provide a way to easily test against composite trait.


Hongze

> On Apr 17, 2019, at 01:15, Haisheng Yuan <h.y...@alibaba-inc.com> wrote:
> 
>> it looks like if we want to get these problems fixed quickly we can just 
>> remove
>> EnumerableValues's collation emitting.
> 
> I am afraid even removing Values collation enumeration won't actually give it 
> a quick fix,
> because Multi-sorted table, if there is, might still encounter the same issue 
> with Values.
> 
> 
> Thanks ~
> Haisheng Yuan
> ------------------------------------------------------------------
> 发件人：Hongze Zhang<notify...@126.com>
> 日　期：2019年04月16日 18:10:17
> 收件人：<dev@calcite.apache.org>
> 主　题：Re: [DISCUSS] RelCompositeTrait
> 
> If we minimize the issue scope to Calcite itself, I think the 3 JIRA 
> tickets: CALCITE-2010, CALCITE-2593, CALCITE-2764 that Haisheng has 
> listed (thanks, Haisheng!) are all related to the multi-sorted 
> EnumerableValue more or less. An it looks like if we want to get these 
> problems fixed quickly we can just remove EnumerableValues's collation 
> emitting. I recall that (correct me if I am wrong) the rel is even not 
> able to emit descending collations, so I suppose it is not perfect at 
> first.
> 
> And another discussion is about enumerating traits. IMHO it's hard to 
> tell Calcite didn't really try avoiding enumerating them already. The 
> methods RelCollationImpl#satisfies[1] and RelDistributions#satisfies[2] 
> already did a job of testing the relationship between traits without 
> checking equality. So the whole thing is looking like we already tried 
> to not to enumerate them but failed at last.
> 
> Regarding the composite traits, one embarrassing thing I can see so far 
> is about the method RelTraitSet#simplify[3]. The JavaDoc says the method 
> is to "return a trait set similar to this one but with all composite 
> traits flattened". But when we look into the related implementations 
> RelCollationTraitDef#getDefault[4]/RelDistributionTraitDef#getDefault[5], 
> they seem not to flatten anything, the traits just simply get wiped. 
> This causes me to worry about if it is really correct that 
> RelCollation/RelDistribution extends RelMultipleTrait, because we can't 
> leverage the trait simplification but are actually hurt by it. If a rel 
> loses it's physical property, we can never prevent from adding 
> unnecessary sorts/exchanges.
> 
> Besides, even if we decide to add some extra sorts/exchanges, that is 
> somehow not easy so far. See CALCITE-2592/CALCITE-2970, the planner is 
> not that smooth to automatically add them.
> 
> Overall, regarding these "small" problems, I think none of them is 
> really impossible to be solved (yes coming up right solutions may be not 
> that straightforward). But of course in future if a brand new design can 
> be proposed to improve the entire trait system (such as avoid 
> enumerating traits), I think that would be totally a great thing.
> 
> Best,
> Hongze
> 
> 
> [1]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationImpl.java#L118
> [2]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributions.java#L143
> [3]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/plan/RelTraitSet.java#L526-L538
> [4]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelCollationTraitDef.java#L58-L60
> [5]https://github.com/apache/calcite/blob/9538374e8fae5cec7d6f7b270850f5dfb4c1fc06/core/src/main/java/org/apache/calcite/rel/RelDistributionTraitDef.java#L47-L49
> 
> ------ Original Message ------
> From: "Haisheng Yuan" <h.y...@alibaba-inc.com>
> To: "Jacques Nadeau" <jacq...@apache.org>; "Apache Calcite dev list" 
> <dev@calcite.apache.org>
> Sent: 2019/4/15 12:27:04
> Subject: Re: Re: [DISCUSS] RelCompositeTrait
> 
>>> There are major challenges with asking for particular traits as well.
>> Imagine a desired aggregate on 7 columns. What does the requestor request
>> with regards to distribution? All seven columns? One column? Some
>> combination in between?
>> 
>> The same challenges exist for enumerating all the traits as well. Imagine
>> there is an order by the 7 grouping keys on top of the aggregate on 7 
>> columns,
>> but with different sort direction:
>> select * from foo group by a,b,c... order by c desc, a asc, b desc...
>> What sort order, direction should the sort-based stream aggregate provide?
>> All ascending, all descending, order (a,b,c...), order(..c,b,a), or all the 
>> combination?
>> All of those enumerated traits are useless except one; for others, additional
>> sort operator will be needed.
>> 
>> Another example is aggregate on top of join, where join on 7 keys, and 
>> aggregate
>> on 2 of the join keys. In distributed system, what distribution trait would 
>> the join
>> operator provide? The 2 grouping keys? All the join keys? All the 
>> combination?
>> 
>> Enumerating some/all the deliverable traits, is not prupose driven. All the 
>> traits
>> may be just useless for parent operator. On the other hand, asking the child
>> operator particular traits, is purpose driven, at least the traits asked by 
>> parent
>> operator are worth consideration, not as wasteful as the former.
>> 
>> If I understand RelCompositeTrait's intent correctly, the enumerated traits, 
>> no
>> matter some combination or all combination, should be saved here. But in 
>> fact,
>> it seems not. And as Jacques mentioned, many people rely on RelMetadata
>> operations to pull up the traitsets through operators.
>> 
>> This makes me curious and wonder if there are any true use cases or systems
>> who rely on RelCompositeTrait. If someone has the story, we would love to 
>> hear.
>> 
>> Put that aside, even RelCompositeTrait is indispensible, why do we bother 
>> optimizing
>> Values node? For values with several tuples, it is not worth optimization, 
>> with
>> many tuples, it may take more time to enumerate the RelCollation than just 
>> sorting it.
>> Specifically for Values with 0 or 1 tuple, but with many columns, it is 
>> definitely not
>> worth the optimization, because sort removal rule and empty rel removal rule 
>> should
>> do the work.
>> 
>> 
>> Thanks ~
>> Haisheng Yuan
>> ------------------------------------------------------------------
>> 发件人：Jacques Nadeau<jacq...@apache.org>
>> 日　期：2019年04月15日 07:36:51
>> 收件人：<dev@calcite.apache.org>
>> 主　题：Re: [DISCUSS] RelCompositeTrait
>> 
>> There are major challenges with asking for particular traits as well.
>> Imagine a desired aggregate on 7 columns. What does the requestor request
>> with regards to distribution? All seven columns? One column? Some
>> combination in between? The trait system in Calcite is very challenging to
>> work with because it is up to downstream users to try to figure out trait
>> propagation outside the core. So challenging, that I believe that many
>> people move to relying on RelMetadata operations since those can be pulled
>> across several operators at once.
>> 
>> It would be great if someone could spend the time to come up with a more
>> global design for these items and we avoid solving one-off problems.
>> Rationalizing when something should be trait, how to avoid trait planning
>> cost explosion, how to propagate, when something should be handled via
>> RelMetadataQuery, when something should be managed via traits versus
>> materialized view alternatives, etc.
>> 
>> An example of overlapping functionality I'd start with is: should
>> multitraits for collation really exist or would exposing these as
>> materialized view alternatives be more appropriate? Why is it necessary to
>> have a 'shortcut' for this situation while other alternatives don't have
>> one?
>> 
>> 
>> 
>> On Mon, Apr 8, 2019 at 4:38 PM Julian Hyde <jh...@apache.org> wrote:
>> 
>>> It seemed reasonable when I introduced it, and seems very reasonable, that
>>> a relational expression (even in the relational model) can have multiple
>>> physical properties. Consider these questions that the planner might ask:
>>> 
>>> Example 1:
>>> 
>>> “Are you sorted on hiredate?”
>>> “Yes”
>>> “Are you sorted on empno?”
>>> “Yes”
>>> “Are you sorted on deptno?”
>>> “No”
>>> 
>>> Example 2:
>>> 
>>> “Can you fit into less than 100MB of memory?”
>>> “Yes”
>>> “Can you fit into less than 10MB of memory?”
>>> “Yes”
>>> “Can you fit into less than 1MB of memory?”
>>> “No”
>>> 
>>> We manage traits like those in example 1 using RelCompositeTrait. We can’t
>>> handle traits like this in example 2, and so we have trained ourselves to
>>> not think of “can fit into memory X” as a trait at all.
>>> 
>>> Perhaps our mistake is to have an API “tell me all of your traits” rather
>>> than an API “do you have trait X?”. Asking a RelNode to enumerate its
>>> traits can be painful: the extreme case is an empty Values with 100
>>> columns; it satisfies any sort order, and there are 100! of these.
>>> 
>>> Julian
>>> 
>>> 
>>> 
>>>> On Apr 8, 2019, at 3:51 PM, Stamatis Zampetakis <zabe...@gmail.com>
>>> wrote:
>>>> 
>>>> Hi Haisheng,
>>>> 
>>>> Thanks for raising awareness around this topic. I also think we should
>>> try
>>>> to find a solution.
>>>> 
>>>> Initially, the Volcano planner was designed to be able to cover multiple
>>>> models (and not only the relational). For non-relational models composite
>>>> traits may be indispensable. I don't know if there are people in this
>>> list
>>>> that are using the planner for other models but if there are it would be
>>>> nice to hear from them.
>>>> 
>>>> Focusing exclusively on the relational model, I think composite traits
>>> are
>>>> useful. One use-case that comes to my mind is data replication. It
>>>> perfectly makes sense to partition (distribute) your table on two (or
>>> more)
>>>> columns to be able execute efficiently queries using special partition
>>>> joins. A concrete use-case is RDF data where many distributed systems
>>> store
>>>> the triples table partitioned by subject and object. I guess such
>>> use-cases
>>>> could possibly be modelled in other ways but composite traits is what
>>> comes
>>>> naturally to my mind.
>>>> 
>>>> Regarding multi-sorted tables it is not that rare if you import sorted
>>> data
>>>> into a table with an auto-increment primary key for example.
>>>> 
>>>> I think all the trait-related issues can be solved if we prioritize them
>>>> correctly. Apart from Vladimir and Hongze, who already spend quite some
>>>> time on these, the rest of us should also jump in and try to help.
>>>> 
>>>> Best,
>>>> Stamatis
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Sun, Apr 7, 2019 at 9:48 AM Haisheng Yuan <h.y...@alibaba-inc.com>
>>> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I found there are some RelCompositeTrait related issues:
>>>>> https://issues.apache.org/jira/browse/CALCITE-2010
>>>>> https://issues.apache.org/jira/browse/CALCITE-2593
>>>>> https://issues.apache.org/jira/browse/CALCITE-2764
>>>>> 
>>>>> Multi-sorted table are rare in pratice, mutil-distributed table doesn't
>>>>> exist either. Values node with several tuples is not worth optimization,
>>>>> with many tuples is not worth optimization either, because the time it
>>>>> takes optimizer to figure out the ordering may be longer than just sort
>>> it
>>>>> in runtime.
>>>>> 
>>>>> In issue https://issues.apache.org/jira/browse/CALCITE-1990,
>>>>> Leo extended RelDistribution to inherit RelMultipleTrait, just like
>>>>> RelCollation does, to solve his problem in the example. But I don't
>>> think
>>>>> this is an appropriate way to represent the equivalence classes (in
>>>>> PostgreSQL's term).
>>>>> 
>>>>> So why did we introduce RelCompisteTrait and RelMultipleTrait in the
>>>>> beginning? Seems like it gives us more pain than gain.
>>>>> 
>>>>> Thanks ~
>>>>> Haisheng Yuan
>>>>> 
>>> 
>>> 
>>

Re: [DISCUSS] RelCompositeTrait

Reply via email to