Re: Re: Problem with converters and possibly rule matching
Hi Vladimir, This is still can be done through top-down request approach. PhysicalFilter operator should request ANY distribution from child operator, unless there is outer reference in the filter condition, in which case, PhysicalFilter should request SINGLETON or BROADCAST distribution. So in your case, PhysicalFilter request ANY, its required distribution will be enforced on filter's output. Regarding index usage, you should have a FIlterTableScan2IndexGet logical transformation rule, and a IndexGet2IndexScan physical implementation rule. Note that IndexGet is a logical operator and IndexScan is a physical operator, which are also used by SQL Server. - Haisheng -- 发件人:Vladimir Ozerov 日 期:2019年11月01日 17:30:26 收件人: 主 题:Re: Problem with converters and possibly rule matching Hi Stamatis, Thank you for your reply. I also thought that we may set the distribution trait during logical planning because it is known in advance. And the example I gave will work! :-) But unfortunately, it will work only because the tree is very simple, and the Project is adjacent to the Scan. This is how my reproducer will work in that case: 1) Root: enforce "SINGLETON" on Project 2) Project: check the logical Scan, infer already resolved distribution, then convert to [PhyiscalProject <- PhysicalScan] 3) Resolve Root enforcer, adding and Exchange if needed. But this stops working as soon as a plan becomes more complex so that it is impossible to infer the distribution from the child immediately. E.g.: LogicalRoot [distribution=SINGLETON] -> LogicalProject // We are here new and cannot produce the physical project -> LogicalFilter[distribution=?] -> LogicalScan[distribution=REPLICATED] This is where your suggestion with cascading enforcement may kick in. But now consider that instead of having REPLICATED distribution, which satisfies SINGLETON, we have a PARTITIONED distribution, It doesn't satisfy SINGLETON, so we have to insert the exchange. Before: PhysicalRoot [SINGLETON] -> PhysicalProject [SINGLETON] -> PhysicalFilter [SINGLETON] -> LogicalScan [PARTITIONED] // SINGLETON is enforced here After: PhysicalRoot [SINGLETON] -> PhysicalProject [SINGLETON] -> PhysicalFilter [SINGLETON] -> SingletonExchange [SINGLETON] -> PhysicalScan [PARTITIONED] But unfortunately, this plan is not optimal. Since we perform Project and Filter after the exchange, we now have to reshuffle the whole table. The optimal plan would be to pushdown Project and Filter past exchange: PhysicalRoot [SINGLETON] -> SingletonExchange [SINGLETON] -> PhysicalProject [PARTITIONED] -> PhysicalFilter [PARTITIONED] -> PhysicalScan [PARTITIONED] But in order to achieve that, now I need to introduce new rules which will attempt to transpose dozens of different operators past exchange, so most likely the planning will never finish :-) This is why it seems that the bottom-up approach seems to be more natural for such cases. Another example is index support. A table may have a number of access methods in addition to plain scan, and every such method may produce a physical scan with different collations. It is not practical to try to enforce something from the top. Instead, we'd better produce alternatives from the bottom. Basically, Drill tries to mix two approaches. First, it tries to derive traits from the child. If it fails and no transformations were created, it just stops trait propagation. As a result, an exchange is created on top of the operator where we stopped, which again leads to not optimal plans. You may observe this in action if you run my reproducer. "testPass" produces an optimal plan with help of abstract converters. "testDrill" produces not optimal plan with the unnecessary exchange. All in all, I cannot get how we can employ distribution metadata and index metadata for optimal planning without using a pure bottom-up approach. Regards, Vladimir. пт, 1 нояб. 2019 г. в 11:42, Stamatis Zampetakis : > The discussion is interesting thanks for the nice examples. > I am replying in this thread since it has all the examples and the history > of the conversation. > > *Part I: Distribution physical or logical property* > > The Volcano paper writes the following regarding properties: > > "Logical properties can be derived from the logical algebra expression and > include schema, expected size, etc., while physical properties depend on > algorithms, e.g., sort order, partitioning, etc." > > We could say that partitioning is not only a property which depends on an > algorithm but a property which depends on the schema. When you have a table > you know in advance that the particular table is replicated. > > Basically, this means that the starting "logical" plan in your case could > be the following: > RootLogicalRel[convention=LOGICAL, distribution=ANY] > -> ProjectLogicalRel[convention=LOGICAL, distribution=ANY] > -> MapScanLogicalRel[convention=LOGICAL,
Re: Re: [DISCUSS] On-demand traitset request
Hi Jinfeng, I think you might have missed the email about proposed API for physical operators I sent out previously in [1]. We don't need request all the permutation, which is also impossible in practice, the search space is going to explode. In the example in email [1], I already talked about your concen on passing down parent request into children may lead to less optimal plan. Besically join operator can send 2 collation optimization requests, one is to pass request down, the other one is ignore the parent's request. Using AbstractConverter to enforce properties is inapporpriate, which handles all the optimization work to physical operator providers, meaning there is almost no physical level optimization mechanism in Calcite. SQL Server and Greenplum's optimizer, which are Cascades framework based, implemented the property enforcement in the core optimizer engine, not through AbstractConverter and rules, physical operators just need to implement those methods (or similar) I mentioned in email [1]. My goal is completely abolishing AbstractConverter. [1] http://mail-archives.apache.org/mod_mbox/calcite-dev/201910.mbox/%3cd75b20f4-542a-4a73-897e-66ab426494c1.h.y...@alibaba-inc.com%3e - Haisheng -- 发件人:Jinfeng Ni 日 期:2019年11月01日 14:10:30 收件人: 主 题:Re: [DISCUSS] On-demand traitset request Hi Xiening, "Let say if R and S doesn’t have sorting properties at all. In your case, we would end up adding enforcers for LHS and RHS to get collation (a, b, c). Then we would need another enforcer to get collation (b, c). This is a sub optimal plan as we could have use (b, c, a) for join." In such case, for step 2 when MergeJoin request a permutation match of (a, b,c) on both it's input, it is not necessary to end up with collation (a, b, c) only. Since it request "permutation", MJ could ask all possible satisfying collations, which include (b, c, a). In other words, the steps I described did not exclude such plan. You may argue it would increase the search space. However, by limiting the search space, without explore all possible choice, we may lose the chance getting 'optimal' plan we want. For instance, in the above example, the idea of passing "on demand" trait request (b,c) from Agg to MJ is to avoid unnecessary sort (b,c). In cases where the join condition has good filtering, and such sort of join output could be quite cheap. Yet in the plan enumeration, since we use "on demand" trait request from parent to guide the actions of MJ, I'm not sure if we may restrict the choices we consider in the legs of join, whose cardinality could be larger and play a bigger role in the overall cost. In other words, by using "on demand" trait request, we may restrict the choices of plan, possibly in the some operators with larger data size. In the current implementation of VolcanoPlanner, I feel the root issue of long planning time is not to explore all possible satisfying trait. It is actually the unnecessary of AbstractConverter, added to the equivalence class. On Fri, Oct 18, 2019 at 10:39 PM Xiening Dai wrote: > > Thanks for the sharing. I like the way you model this problem, Jinfeng. > > There’s one minor issue with your example. Let say if R and S doesn’t have > sorting properties at all. In your case, we would end up adding enforcers for > LHS and RHS to get collation (a, b, c). Then we would need another enforcer > to get collation (b, c). This is a sub optimal plan as we could have use (b, > c, a) for join. > > I think in step #2, the join operator would need to take the agg trait > requirement into account. Then it would have two options - > > 1) require *exact/super* match of (b, c, a) or (c, b, a); this is to > guarantee the join output would deliver the collation agg needs. > 2) require permutation match of (a, b, c); in such case, an enforcer might be > needed for aggregation. > > Eventually the cost model decides who is the winner. > > There’s a fundamental difference between your model and Haisheng’s proposal. > In Haisheng’s case, a rel node not only looks at its parent’s requirement, > but also tries to get the potential traits its input could deliver. It would > try to align them to eliminate unnecessary alternatives. > > In above example, assuming R is (b, c, a) and S is (a, b, c), to implement > option 1), we would generate two alternatives - > > MergeJoin (b, c, a) > TableScan R > Sort(b, c, a) > TableScan S > > MergeJoin(c, b, a) > Sort(c, b, a) > TableScan R > Sort(c, b, a) > TableScan S > > But if we look at the input traits and has the insight that R already > delivers (b, c, a), we could decide to require (b, c, a) only and avoid > generating the 2nd plan, which is definitely worse, and reduce the search > space. > > > > On Oct 18, 2019, at 4:57 PM, Jinfeng Ni wrote: > > > > A little bit of history. In Drill, when we first implemented > > Distribution trait's definition, we allows both exact match
[jira] [Created] (CALCITE-3472) SESSION_END returns same value as SESSION_START
Pablo Estrada created CALCITE-3472: -- Summary: SESSION_END returns same value as SESSION_START Key: CALCITE-3472 URL: https://issues.apache.org/jira/browse/CALCITE-3472 Project: Calcite Issue Type: Improvement Components: stream Reporter: Pablo Estrada We've found that the session_end function seems to be returning the same values as session_start. Furthermore, it seems that they may have the same implementation, though I'm not fully familiar with the code: [https://github.com/apache/calcite/blob/d7946a94adfd2e788f5d324910944dd65dab11ee/core/src/main/java/org/apache/calcite/sql2rel/AuxiliaryConverter.java#L50-L68] An example with Beam SQL: 0: BeamSQL> SELECT SESSION_START(scores_stream.event_time, INTERVAL '1' SECOND), SESSION_END(scores_stream.event_time, INTERVAL '1' SECOND), scores_stream.team, SUM(scores_stream.score), COUNT(*) FROM scores_stream GROUP BY scores_stream.team, SESSION(scores_stream.event_time, INTERVAL '1' SECOND) LIMIT 3;SELECT . . . . . > SESSION_START(scores_stream.event_time, INTERVAL '1' SECOND), . . . . . > SESSION_END(scores_stream.event_time, INTERVAL '1' SECOND), . . . . . > scores_stream.team, . . . . . > SUM(scores_stream.score), . . . . . > COUNT(*) . . . . . > FROM . . . . . > scores_stream . . . . . > GROUP BY scores_stream.team, . . . . . > T 3; SESSION(scores_stream.event_time, INTERVAL '1' SECOND) LIMI +++--++-+ | EXPR$0 | EXPR$1 | team | EXPR$3 | EXPR$4 | +++--++-+ | 2019-11-04 04:11:38 | 2019-11-04 04:11:38 | blue | 420 | 7 | | 2019-11-04 04:11:38 | 2019-11-04 04:11:38 | red | 960 | 18 | | 2019-11-04 04:11:42 | 2019-11-04 04:11:42 | blue | 452 | 11 | +++--++-+ 3 rows selected (9.197 seconds) 0: BeamSQL> -- This message was sent by Atlassian Jira (v8.3.4#803005)
Calcite-Master - Build # 1408 - Failure
The Apache Jenkins build system has built Calcite-Master (build #1408) Status: Failure Check console output at https://builds.apache.org/job/Calcite-Master/1408/ to view the results.
Re: [ANNOUNCE] Danny Chan joins Calcite PMC
Congratulations Danny! Very well deserved! Julian Am 01.11.19, 20:49 schrieb "Muhammad Gelbana" : Congratulations! Thanks, Gelbana On Fri, Nov 1, 2019 at 9:07 AM Stamatis Zampetakis wrote: > Congratulations Danny! > > You are doing an amazing job. The project and the community is becoming > better every day and your help is much appreciated. > > Keep up the momentum! > > Best, > Stamatis > > On Thu, Oct 31, 2019 at 4:41 AM Kurt Young wrote: > > > Congratulations Danny! > > > > Best, > > Kurt > > > > > > On Thu, Oct 31, 2019 at 11:18 AM Danny Chan > wrote: > > > > > Thank you so much colleagues, it’s my honor to work with you! > > > > > > I have always felt respected and the harmony of the community, hope to > > > contribute more and I would give help as best as I can, thanks ! > > > > > > Best, > > > Danny Chan > > > 在 2019年10月31日 +0800 AM5:22,Francis Chuang >,写道: > > > > I'm pleased to announce that Danny has accepted an invitation to > > > > join the Calcite PMC. Danny has been a consistent and helpful > > > > figure in the Calcite community for which we are very grateful. We > > > > look forward to the continued contributions and support. > > > > > > > > Please join me in congratulating Danny! > > > > > > > > - Francis (on behalf of the Calcite PMC) > > > > > >