Re: Re-working the metadata framework

Julian Hyde Thu, 21 Jan 2016 16:56:58 -0800

I can’t see any requirements that we can drop. Here are the requirements:


* We need multiple providers (e.g. built-in and Drill),
* It’s not essential but it’s very convenient to find providers by looking for 
methods that are like the implemented signature except that the first argument 
can be a sub-class
* We need to detect cycles and cache previously computed results
* We need to be able to flush the cache
* We need to allow kinds of metadata to be defined at run time (although it’s 
certainly more convenient if we can add an accessor method to RelMetadataQuery)
* We need to be able to create composite providers.
* We need to be able to add providers at run time (and indeed use different 
combinations of providers during the planning of a single query).
* We can’t assume that we know ahead of time the full set of RelNode 
sub-classes for which we will ask for metadata.
* We need to be able to unwind or see through HepRelVertex or RelSubset nodes 
that are introduced by the Hep and Volcano planning engines.
* For RelSubset we need to combine the constituent rels’ metadata (e.g. take 
the min of the MaxRowCount, or the union of the inferred predicates).

I’ve dropped the requirement that you create an instance of the metadata class 
by “binding” a RelNode and a RelMetadataQuery. Now for each metadata interface 
you define a handler interface that has an extra 2 parameters. So we no longer 
need to create temporary “bound” objects.

Given all those requirements we have a complex dispatch problem. Reflection 
solved it but it was difficult to understand AND inefficient. Code generation 
is difficult to understand but yields code that is easy to understand (if you 
turn on -Dcalcite.debug) and is efficient.

My dev branch is complete and ready for review. Take a look at 
https://issues.apache.org/jira/browse/CALCITE-604 for an example of the 
generated code; I think it clarifies things a lot.

Julian



> On Jan 18, 2016, at 9:31 PM, Jacques Nadeau <[email protected]> wrote:
> 
> I read your email.  My thought was that we should evaluate the real
> requirements. Maybe we need to shed some capabilities to achieve better
> performance. Can you break down the issue into a set of example patterns?
> In general, it seems like we may be trying to be too flexible. I understand
> the desire to maintain as much backwards compatibility as possible but
> think we may want to come up with a simpler metaphor. My goal is to come up
> with the right design rather than trying to optimize what might be the
> wrong design. Maybe you've done all the thinking and believe the existing
> metaphors are exactly right but I think we should get consensus around this
> the requirements before jumping to implementation approaches such as code
> generation.
> On Jan 18, 2016 5:40 PM, "Julian Hyde" <[email protected]> wrote:
> 
>> I describe in paragraph 3 why we CURRENTLY use reflection. In short: very
>> complex dispatch requirements. The rest of the message is how I plan to
>> phase it out. Plain java only has one dispatch mechanism (virtual methods)
>> so isn’t going to cut it.
>> 
>>> On Jan 18, 2016, at 5:23 PM, Jacques Nadeau <[email protected]> wrote:
>>> 
>>> Can you go into more detail to why reflection is needed? It seems like we
>>> could get away from reflection by sharing interfaces, etc.
>>> 
>>> On Mon, Jan 18, 2016 at 5:12 PM, Julian Hyde <[email protected]> wrote:
>>> 
>>>> In https://issues.apache.org/jira/browse/CALCITE-794 <
>>>> https://issues.apache.org/jira/browse/CALCITE-794> we added an extra
>>>> parameter to each metadata call so that we could detect cyclic metadata
>>>> calls, and potentially to cache results so that a given statistic is
>> never
>>>> computed more than once during a metadata call. But the overhead of
>> making
>>>> calls into the metadata framework is still very high. It shows up as a
>> big
>>>> fraction of the time spent optimizing complex queries. I am working on
>>>> https://issues.apache.org/jira/browse/CALCITE-604 <
>>>> https://issues.apache.org/jira/browse/CALCITE-604>, which aims to fix
>>>> that.
>>>> 
>>>> I am working on 604 while the release is closing, and I thought some of
>>>> you would be be interested to know where I am going.
>>>> 
>>>> We use reflection to make calls. This is necessary because the types of
>>>> metadata (e.g. selectivity, row count, unique keys, predicates) are
>>>> extensible, you can have multiple providers for each kind of metadata,
>> each
>>>> provider has different methods for various RelNode sub-types, and we
>> want
>>>> to be able to inherit handler methods (e.g. getUniqueKeys(Aggregate,
>>>> boolean) handles getUniqueKeys(LogicalAggregate, boolean) because there
>> is
>>>> no handler method for LogicalAggregate.
>>>> 
>>>> Initially I thought we’d use MethodHandle, which is a lot faster than
>>>> method invocation by reflection. MethodHandle.invoke has some
>> flexibility
>>>> based on the types of its arguments, but I realized we’d still have to
>>>> dispatch to multiple underlying providers (e.g. the built-in provider
>> and
>>>> the Hive provider). And we have other inefficiencies such as calling
>>>> UnboundMetadata.bind(RelNode, RelMetadataQuery) to create a short-lived
>>>> object every single call.
>>>> 
>>>> So, now I am looking at using Janino to generate a dispatcher. Consider
>>>> just one kind of metadata, UniqueKeys. We already have a “signature”
>>>> interface:
>>>> 
>>>> public interface UniqueKeys extends Metadata {
>>>> Set<ImmutableBitSet> getUniqueKeys(boolean ignoreNulls);
>>>> }
>>>> 
>>>> I have added a handler interface:
>>>> 
>>>> interface UniqueKeysHandler {
>>>> Set<ImmutableBitSet> getUniqueKeys(RelNode r, RelMetadataQuery mq,
>>>> boolean ignoreNulls);
>>>> }
>>>> 
>>>> Now, given a set of metadata providers and the set of all known RelNode
>>>> sub-type, I can use Janino to generate a handler at run time:
>>>> 
>>>> class UniqueKeysHandlerImpl implements UniqueKeysHandlerImpl {
>>>> final RelMdUniqueKeys provider0;
>>>> final HiveUniqueKeys provider1;
>>>> 
>>>> UniqueKeysHandlerImpl(RelMdUniqueKeys provider0, HiveUniqueKeys
>>>> provider1) {
>>>>   this.provider0 = provider0;
>>>>   this.provider1 = provider1;
>>>> }
>>>> 
>>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode r,
>>>>     RelMetadataQuery mq, boolean ignoreNulls) {
>>>>   switch (r.getClass().getName()) {
>>>>   case "org.apache.calcite.rel.logical.LogicalAggregate":
>>>>     return provider0.getUniqueKeys((Aggregate) r, mq, ignoreNulls);
>>>>   case "org.apache.calcite.rel.core.Aggregate":
>>>>     return provider0.getUniqueKeys(r, mq, ignoreNulls);
>>>>   case “
>>>> org.apache.hadoop.hive.ql.optimizer.calcite.reloperators.HiveAggregate":
>>>>     return provider1.getUniqueKeys(r, mq, ignoreNulls);
>>>>   default:
>>>>     throw NoHandler.INSTANCE;
>>>>   }
>>>> }
>>>> }
>>>> 
>>>> The entry point in RelMetadataQuery changes from
>>>> 
>>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>>>>   boolean ignoreNulls) {
>>>> final BuiltInMetadata.UniqueKeys metadata =
>>>>     rel.metadata(BuiltInMetadata.UniqueKeys.class, this);
>>>> return metadata.getUniqueKeys(ignoreNulls);
>>>> }
>>>> 
>>>> to
>>>> 
>>>> public Set<ImmutableBitSet> getUniqueKeys(RelNode rel,
>>>>   boolean ignoreNulls) {
>>>> for (;;) {
>>>>   try {
>>>>     return uniqueKeysHandler.getUniqueKeys(rel, this, ignoreNulls);
>>>>   } catch (NoHandler e) {
>>>>     uniqueKeysHandler = metadataProvider.revise(rel.getClass(),
>>>>         BuiltInMetadata.UniqueKeys.Handler.class);
>>>>   }
>>>> }
>>>> }
>>>> 
>>>> The “NoHandler” exception occurs very rarely — only when a kind of
>> RelNode
>>>> is seen that hasn’t been seen before in this JVM instance — but gives
>> the
>>>> handler chance to regenerate itself.
>>>> 
>>>> The result is a very direct path from the caller (generally a
>> RelOptRule)
>>>> to the provider: two calls, and we don’t even need to box the arguments.
>>>> 
>>>> I don’t think there will be any API changes, but note that the metadata
>>>> interfaces (eg. UniqueKeys) and RelNode.metadata(Class<M> metadataClass,
>>>> RelMetadataQuery mq) are not used anymore.
>>>> 
>>>> Julian
>>>> 
>>>> 
>> 
>>

Re: Re-working the metadata framework

Reply via email to