Re: Re: Problem with converters and possibly rule matching

2019-11-03 Thread Haisheng Yuan
Hi Vladimir,

This is still can be done through top-down request approach.

PhysicalFilter operator should request ANY distribution from child operator, 
unless there is outer reference in the filter condition, in which case, 
PhysicalFilter should request SINGLETON or BROADCAST distribution. So in your 
case, PhysicalFilter request ANY, its required distribution will be enforced on 
filter's output.

Regarding index usage, you should have a FIlterTableScan2IndexGet logical 
transformation rule, and a IndexGet2IndexScan physical implementation rule. 
Note that IndexGet is a logical operator and IndexScan is a physical operator, 
which are also used by SQL Server.

- Haisheng

--
发件人:Vladimir Ozerov
日 期:2019年11月01日 17:30:26
收件人:
主 题:Re: Problem with converters and possibly rule matching

Hi Stamatis,

Thank you for your reply. I also thought that we may set the distribution
trait during logical planning because it is known in advance. And the
example I gave will work! :-) But unfortunately, it will work only because
the tree is very simple, and the Project is adjacent to the Scan. This is
how my reproducer will work in that case:
1) Root: enforce "SINGLETON" on Project
2) Project: check the logical Scan, infer already resolved distribution,
then convert to [PhyiscalProject <- PhysicalScan]
3) Resolve Root enforcer, adding and Exchange if needed.

But this stops working as soon as a plan becomes more complex so that it is
impossible to infer the distribution from the child immediately. E.g.:
LogicalRoot [distribution=SINGLETON]
-> LogicalProject // We are here new and cannot produce the physical project
 -> LogicalFilter[distribution=?]
 -> LogicalScan[distribution=REPLICATED]

This is where your suggestion with cascading enforcement may kick in. But
now consider that instead of having REPLICATED distribution, which
satisfies SINGLETON, we have a PARTITIONED distribution, It doesn't satisfy
SINGLETON, so we have to insert the exchange.

Before:
PhysicalRoot [SINGLETON]
-> PhysicalProject [SINGLETON]
 -> PhysicalFilter [SINGLETON]
 -> LogicalScan [PARTITIONED] // SINGLETON is enforced here

After:
PhysicalRoot [SINGLETON]
-> PhysicalProject [SINGLETON]
 -> PhysicalFilter [SINGLETON]
 -> SingletonExchange [SINGLETON]
 -> PhysicalScan [PARTITIONED]

But unfortunately, this plan is not optimal. Since we perform Project and
Filter after the exchange, we now have to reshuffle the whole table. The
optimal plan would be to pushdown Project and Filter past exchange:
PhysicalRoot [SINGLETON]
-> SingletonExchange [SINGLETON]
 -> PhysicalProject [PARTITIONED]
 -> PhysicalFilter [PARTITIONED]
 -> PhysicalScan [PARTITIONED]

But in order to achieve that, now I need to introduce new rules which will
attempt to transpose dozens of different operators past exchange, so most
likely the planning will never finish :-)

This is why it seems that the bottom-up approach seems to be more natural
for such cases. Another example is index support. A table may have a number
of access methods in addition to plain scan, and every such method may
produce a physical scan with different collations. It is not practical to
try to enforce something from the top. Instead, we'd better produce
alternatives from the bottom.

Basically, Drill tries to mix two approaches. First, it tries to derive
traits from the child. If it fails and no transformations were created, it
just stops trait propagation. As a result, an exchange is created on top of
the operator where we stopped, which again leads to not optimal plans. You
may observe this in action if you run my reproducer. "testPass" produces an
optimal plan with help of abstract converters. "testDrill" produces not
optimal plan with the unnecessary exchange.

All in all, I cannot get how we can employ distribution metadata and index
metadata for optimal planning without using a pure bottom-up approach.

Regards,
Vladimir.

пт, 1 нояб. 2019 г. в 11:42, Stamatis Zampetakis :

> The discussion is interesting thanks for the nice examples.
> I am replying in this thread since it has all the examples and the history
> of the conversation.
>
> *Part I: Distribution physical or logical property*
>
> The Volcano paper writes the following regarding properties:
>
> "Logical properties can be derived from the logical algebra expression and
> include schema, expected size, etc., while physical properties depend on
> algorithms, e.g., sort order, partitioning, etc."
>
> We could say that partitioning is not only a property which depends on an
> algorithm but a property which depends on the schema. When you have a table
> you know in advance that the particular table is replicated.
>
> Basically, this means that the starting "logical" plan in your case could
> be the following:
> RootLogicalRel[convention=LOGICAL, distribution=ANY]
> -> ProjectLogicalRel[convention=LOGICAL, distribution=ANY]
> -> MapScanLogicalRel[convention=LOGICAL, 

Re: Re: [DISCUSS] On-demand traitset request

2019-11-03 Thread Haisheng Yuan
Hi Jinfeng,

I think you might have missed the email about proposed API for physical 
operators I sent out previously in [1].

We don't need request all the permutation, which is also impossible in 
practice, the search space is going to explode.

In the example in email [1], I already talked about your concen on passing down 
parent request into children may lead to less optimal plan. Besically join 
operator can send 2 collation optimization requests, one is to pass request 
down, the other one is ignore the parent's request. 

Using AbstractConverter to enforce properties is inapporpriate, which handles 
all the optimization work to physical operator providers, meaning there is 
almost no physical level optimization mechanism in Calcite. SQL Server and 
Greenplum's optimizer, which are Cascades framework based, implemented the 
property enforcement in the core optimizer engine, not through 
AbstractConverter and rules, physical operators just need to implement those 
methods (or similar) I mentioned in email [1]. My goal is completely abolishing 
AbstractConverter.

[1] 
http://mail-archives.apache.org/mod_mbox/calcite-dev/201910.mbox/%3cd75b20f4-542a-4a73-897e-66ab426494c1.h.y...@alibaba-inc.com%3e

- Haisheng

--
发件人:Jinfeng Ni
日 期:2019年11月01日 14:10:30
收件人:
主 题:Re: [DISCUSS] On-demand traitset request

Hi Xiening,

"Let say if R and S doesn’t have sorting properties at all. In your
case, we would end up adding enforcers for LHS and RHS to get
collation (a, b, c). Then we would need another enforcer to get
collation (b, c). This is a sub optimal plan as we could have use (b,
c, a) for join."

In such case, for step 2 when MergeJoin request a permutation match of
(a, b,c) on both it's input, it is not necessary to end up with
collation (a, b, c) only. Since it request "permutation", MJ could ask
all possible satisfying collations, which include (b, c, a). In other
words, the steps I described did not exclude such plan.

You may argue it would increase the search space. However, by
limiting the search space, without explore all possible choice, we may
lose the chance getting 'optimal' plan we want. For instance, in the
above example, the idea of passing "on demand" trait request (b,c)
from Agg to MJ is to avoid unnecessary sort (b,c). In cases where the
join condition has good filtering, and such sort of join output could
be quite cheap. Yet in the plan enumeration, since we use "on demand"
trait request from parent to guide the actions of MJ, I'm not sure if
we may restrict the choices we consider in the legs of join, whose
cardinality could be larger and play a bigger role in the overall
cost.

In other words, by using "on demand" trait request, we may restrict
the choices of plan, possibly in the some operators with larger data
size.

In the current implementation of VolcanoPlanner, I feel the root issue
of long planning time is not to explore all possible satisfying trait.
It is actually the unnecessary of AbstractConverter, added to the
equivalence class.


On Fri, Oct 18, 2019 at 10:39 PM Xiening Dai  wrote:
>
> Thanks for the sharing. I like the way you model this problem, Jinfeng.
>
> There’s one minor issue with your example. Let say if R and S doesn’t have 
> sorting properties at all. In your case, we would end up adding enforcers for 
> LHS and RHS to get collation (a, b, c). Then we would need another enforcer 
> to get collation (b, c). This is a sub optimal plan as we could have use (b, 
> c, a) for join.
>
> I think in step #2, the join operator would need to take the agg trait 
> requirement into account. Then it would have two options -
>
> 1) require *exact/super* match of (b, c, a) or (c, b, a); this is to 
> guarantee the join output would deliver the collation agg needs.
> 2) require permutation match of (a, b, c); in such case, an enforcer might be 
> needed for aggregation.
>
> Eventually the cost model decides who is the winner.
>
> There’s a fundamental difference between your model and Haisheng’s proposal. 
> In Haisheng’s case, a rel node not only looks at its parent’s requirement, 
> but also tries to get the potential traits its input could deliver. It would 
> try to align them to eliminate unnecessary alternatives.
>
> In above example, assuming R is (b, c, a) and S is (a, b, c), to implement 
> option 1), we would generate two alternatives -
>
> MergeJoin (b, c, a)
> TableScan R
> Sort(b, c, a)
> TableScan S
>
> MergeJoin(c, b, a)
> Sort(c, b, a)
> TableScan R
> Sort(c, b, a)
> TableScan S
>
> But if we look at the input traits and has the insight that R already 
> delivers (b, c, a), we could decide to require (b, c, a) only and avoid 
> generating the 2nd plan, which is definitely worse, and reduce the search 
> space.
>
>
> > On Oct 18, 2019, at 4:57 PM, Jinfeng Ni  wrote:
> >
> > A little bit of history. In Drill, when we first implemented
> > Distribution trait's definition, we allows both exact match 

[jira] [Created] (CALCITE-3472) SESSION_END returns same value as SESSION_START

2019-11-03 Thread Pablo Estrada (Jira)
Pablo Estrada created CALCITE-3472:
--

 Summary: SESSION_END returns same value as SESSION_START
 Key: CALCITE-3472
 URL: https://issues.apache.org/jira/browse/CALCITE-3472
 Project: Calcite
  Issue Type: Improvement
  Components: stream
Reporter: Pablo Estrada


We've found that the session_end function seems to be returning the same values 
as session_start. Furthermore, it seems that they may have the same 
implementation, though I'm not fully familiar with the code:

 

[https://github.com/apache/calcite/blob/d7946a94adfd2e788f5d324910944dd65dab11ee/core/src/main/java/org/apache/calcite/sql2rel/AuxiliaryConverter.java#L50-L68]

 

An example with Beam SQL:

 

0: BeamSQL> SELECT 
 SESSION_START(scores_stream.event_time, INTERVAL '1' SECOND), 
 SESSION_END(scores_stream.event_time, INTERVAL '1' SECOND), 
 scores_stream.team, 
 SUM(scores_stream.score),
 COUNT(*)
FROM 
 scores_stream 
GROUP BY scores_stream.team, 
 SESSION(scores_stream.event_time, INTERVAL '1' SECOND) LIMIT 3;SELECT 
. . . . . > SESSION_START(scores_stream.event_time, INTERVAL '1' SECOND), 
. . . . . > SESSION_END(scores_stream.event_time, INTERVAL '1' SECOND), 
. . . . . > scores_stream.team, 
. . . . . > SUM(scores_stream.score),
. . . . . > COUNT(*)
. . . . . > FROM 
. . . . . > scores_stream 
. . . . . > GROUP BY scores_stream.team, 
. . . . . > 
T 3; SESSION(scores_stream.event_time, INTERVAL '1' SECOND) LIMI 
+++--++-+
| EXPR$0 | EXPR$1 | team | EXPR$3 | EXPR$4 |
+++--++-+
| 2019-11-04 04:11:38 | 2019-11-04 04:11:38 | blue | 420 | 7 |
| 2019-11-04 04:11:38 | 2019-11-04 04:11:38 | red | 960 | 18 |
| 2019-11-04 04:11:42 | 2019-11-04 04:11:42 | blue | 452 | 11 |
+++--++-+
3 rows selected (9.197 seconds)
0: BeamSQL>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


Calcite-Master - Build # 1408 - Failure

2019-11-03 Thread Apache Jenkins Server
The Apache Jenkins build system has built Calcite-Master (build #1408)

Status: Failure

Check console output at https://builds.apache.org/job/Calcite-Master/1408/ to 
view the results.

Re: [ANNOUNCE] Danny Chan joins Calcite PMC

2019-11-03 Thread Julian Feinauer
Congratulations Danny! Very well deserved!

Julian

Am 01.11.19, 20:49 schrieb "Muhammad Gelbana" :

Congratulations!

Thanks,
Gelbana


On Fri, Nov 1, 2019 at 9:07 AM Stamatis Zampetakis 
wrote:

> Congratulations Danny!
>
> You are doing an amazing job. The project and the community is becoming
> better every day and your help is much appreciated.
>
> Keep up the momentum!
>
> Best,
> Stamatis
>
> On Thu, Oct 31, 2019 at 4:41 AM Kurt Young  wrote:
>
> > Congratulations Danny!
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Oct 31, 2019 at 11:18 AM Danny Chan 
> wrote:
> >
> > > Thank you so much colleagues, it’s my honor to work with you!
> > >
> > > I have always felt respected and the harmony of the community, hope to
> > > contribute more and I would give help as best as I can, thanks !
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年10月31日 +0800 AM5:22,Francis Chuang  >,写道:
> > > > I'm pleased to announce that Danny has accepted an invitation to
> > > > join the Calcite PMC. Danny has been a consistent and helpful
> > > > figure in the Calcite community for which we are very grateful. We
> > > > look forward to the continued contributions and support.
> > > >
> > > > Please join me in congratulating Danny!
> > > >
> > > > - Francis (on behalf of the Calcite PMC)
> > >
> >
>