Re: Exposing multiple values of a trait from the operator

Haisheng Yuan Tue, 25 May 2021 11:43:07 -0700

Negative, currently.

It is indeed complicated, but still acceptable IMO. If the community have 
consensus on this matter, we can create a JIRA to add the feature to trait and 
handle equivalent keys trait satisfaction in trait.satisfy() and column 
remapping in trait.apply(), hope that can ease some pain, if you happen to use 
Calcite's default distribution/collation implementation.


Regards,
Haisheng Yuan

On 2021/05/25 17:45:32, Vladimir Ozerov <[email protected]> wrote: 
> Hi Haisheng,
> 
> Thank you for the advice. This is exactly how I designed distribution at
> the moment (the approach 2 from my original email) - as a List<int[]>
> instead of just int[]. My main concern was the increased complexity of the
> trait propagation/derivation, as I have to manage these nested lists by
> hand. Nevertheless, it works well. So I hoped that there are better
> built-in approaches that I may use. If the answer is negative, I'll
> continue using the original approach, when multiple alternatives managed
> manually.
> 
> Regards,
> Vladimir.
> 
> вт, 25 мая 2021 г. в 20:30, Haisheng Yuan <[email protected]>:
> 
> > Hi Vladimir,
> >
> > Glad to see you raised the question.
> >
> > Here is the advice:
> > Do not use RelMultipleTrait/RelCompositeTrait, which is fundamentally
> > flawed and has many bugs. It can't work properly no matter for top-down or
> > bottom-up.
> >
> > Instead, we need to add equivalent keys bitmap as the property of physical
> > trait like RelCollation, RelDistribution.
> >
> > For example:
> > class RelDistributionImpl {
> >   // list of distribution keys
> >   private ImmutableIntList keys;
> >
> >    // list of equivalent bitset for each distribution key
> >   private ImmutableList<ImmutableBitSet> equivBitSets;
> > }
> >
> > In the trait satisfy and column remapping, we also need to take equivalent
> > keys into consideration. Some of the work need to be done in Calcite core
> > framework.
> >
> > Greenplum Orca optimizer has similar strategy:
> >
> > https://github.com/greenplum-db/gporca/blob/master/libgpopt/include/gpopt/base/CDistributionSpecHashed.h#L44
> >
> > Regards,
> > Haisheng Yuan
> >
> > On 2021/05/25 15:37:32, Vladimir Ozerov <[email protected]> wrote:
> > > Hi,
> > >
> > > Consider the distributed SQL engine that uses a distribution property to
> > > model exchanges. Consider the following physical tree. To do the
> > > distributed join, we co-locate tuples using the equijoin key. Now the
> > Join
> > > operator has two equivalent distributions - [a1] and [b1]. It is critical
> > > to expose both distributions so that the top Aggregate can take advantage
> > > of the co-location.
> > >
> > > Aggregate[group=b1]
> > >   DistributedJoin[a.a1=b.b1]   // SHARDED[a1], SHARDED[b1]
> > >     Input[a]                   // SHARDED[a1]
> > >     Input[b]                   // SHARDED[b1]
> > >
> > > A similar example for the Project:
> > > Aggregate[group=$1]
> > >   Project[$0=a, $1=a] // SHARDED[$0], SHARDED[$1]
> > >     Input             // SHARDED[a]
> > >
> > > The question is how to model this situation properly?
> > >
> > > First, it seems that RelMultipleTrait and RelCompositeTrait were designed
> > > to handle this situation. However, I couldn't make them work with the
> > > top-down optimizer. The reason is that when we register a RelNode with a
> > > composite trait in MEMO, VolcanoPlanner flattens the composite trait into
> > > the default trait value in RelSet.add -> RelTraitSet.simplify. That is,
> > the
> > > trait [SHARDED[a], SHARDED[b]] will be converted to [ANY] so that the
> > > original traits could not be derived in the PhysicalNode.derive methods.
> > >
> > > Second, we may try to model multiple sharding keys in a single trait. But
> > > this complicates the implementation of PhysicalNode.passThrough/derive
> > > significantly.
> > > SHARDED[a1, a2], SHARDED[b1, b2] -> SHARDED[[a1, a2], [b1, b2]]
> > >
> > > Third, we may expose multiple traits using metadata. RelMdDistribution
> > > would not work, because it exposes only a single distribution. But a
> > custom
> > > handler may potentially fix that. However, it will not be integrated with
> > > the top-down optimizer still, which makes the idea questionable.
> > >
> > > To summarize, it seems that currently there is no easy way to handle
> > > composite traits with a top-down optimizer. I wonder whether someone from
> > > the devlist already solved similar issues in Apache Calcite or other
> > > optimizers. If so, what was the approach or best practices? Intuitively,
> > it
> > > seems that RelMultipleTrait/RelCompositeTrait approach might be the way
> > to
> > > go. But why do we replace the original composite trait set with the
> > default
> > > value in the RelTraitSet.simplify routine?
> > >
> > > Regards,
> > > Vladimir.
> > >
> >
>

Re: Exposing multiple values of a trait from the operator

Reply via email to