LGTM, I think by the time we have support for the outer joins, I might have had time to finish the filter tree index implementation too.
-Jesús On 8/29/17, 3:11 AM, "Christian Beikov" <[email protected]> wrote: >I'd like to stick to trying to figure out how to support outer joins for >now and when I have an implementation for that, I'd look into the filter >tree index if you haven't done it by then. > > >Mit freundlichen Grüßen, >------------------------------------------------------------------------ >*Christian Beikov* >Am 28.08.2017 um 20:01 schrieb Jesus Camacho Rodriguez: >> Christian, >> >> The implementation of the filter tree index is what I was referring to >> indeed. In the initial implementation I focused on the rewriting coverage, >> but now that the first part is finished, it is at the top of my list as >> I think it is critical to make the whole query rewriting algorithm work >> at scale. However, I have not started yet. >> >> The filter tree index will help to filter not only based on the tables used >> by a given query, but also for queries that do not meet the equivalence >> classes conditions, filter conditions, etc. We could implement all the >> preconditions mentioned in the paper, and we could add our own additional >> ones. I also think that in a second version, we might need to maybe add >> some kind of ranking/limit as many views might meet the preconditions for >> a given query. >> >> It seems you understood how it should work, so if you could help to >> quickstart that work by maybe implementing a first version of the filter >> tree index with a couple of basic conditions (table matching and EC >> matching?), >> that would be great. I could review any of the contributions you make. >> >> -Jesús >> >> >> >> >> >> On 8/28/17, 3:22 AM, "Christian Beikov" <[email protected]> wrote: >> >>> If the metadata was cached, that would be awesome, especially because >>> that would also improve the prformance regarding the metadata retrival >>> for the query currently being planned, although I am not sure how the >>> caching would work since the RelNodes are mutable. >>> >>> Have you considered implementing the filter tree index explained in the >>> paper? As far as I understood, the whole thing only works when a >>> redundant table elimination is implemented. Is that the case? If so, or >>> if it can be done easily, I'd propose we initialize all the lookup >>> structures during registration and use them during planning. This will >>> improve planning time drastically and essentially handle the scalability >>> problem you mention. >>> >>> What other MV-related issues are on your personal todo list Jesus? I >>> read the paper now and think I can help you in one place or another if >>> you want. >>> >>> >>> Mit freundlichen Grüßen, >>> ------------------------------------------------------------------------ >>> *Christian Beikov* >>> Am 28.08.2017 um 08:13 schrieb Jesus Camacho Rodriguez: >>>> Hive does not use the Calcite SQL parser, thus we follow a different path >>>> and did not experience the problem on the Calcite end. However, FWIW we >>>> avoided reparsing the SQL every time a query was being planned by >>>> creating/managing our own cache too. >>>> >>>> The metadata providers implement some caching, thus I would expect that >>>> once >>>> you avoid reparsing every MV, the retrieval time of predicates, lineage, >>>> etc. >>>> would improve (at least after using the MV for the first time). However, >>>> I agree that the information should be inferred when the MV is loaded. >>>> In fact, maybe just making some calls to the metadata providers while the >>>> MVs >>>> are being loaded would do the trick (Julian should confirm this). >>>> >>>> Btw, probably you will find another scalability issue as the number of MVs >>>> grows large with the current implementation of the rewriting, since the´ >>>> pre-filtering implementation in place does not discard many of the views >>>> that >>>> are not valid to rewrite a given query, and rewriting is attempted with all >>>> of them. >>>> This last bit is work that I would like to tackle shortly, but I have not >>>> created the corresponding JIRA yet. >>>> >>>> -Jesús >>>> >>>> >>>> >>>> >>>> On 8/27/17, 10:43 PM, "Rajat Venkatesh" <[email protected]> wrote: >>>> >>>>> Thread Safety and repeated parsing is a problem. We have experience with >>>>> managing 10s of materialized views. Repeated parsing takes more time than >>>>> execution of the query itself. We also have a similar problem where >>>>> concurrent queries (with a different set of materialized views >>>>> potentailly) >>>>> maybe planned at the same time. We solved it through maintaining a cache >>>>> and carefully setting the cache in a thread local. >>>>> Relevant code for inspiration: >>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/prepare/Materializer.java >>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/plan/QuarkMaterializeCluster.java >>>>> >>>>> >>>>> >>>>> On Sun, Aug 27, 2017 at 6:50 PM Christian Beikov >>>>> <[email protected]> >>>>> wrote: >>>>> >>>>>> Hey, I have been looking a bit into how materialized views perform >>>>>> during the planning because of a very long test >>>>>> run(MaterializationTest#testJoinMaterializationUKFK6) and the current >>>>>> state is problematic. >>>>>> >>>>>> CalcitePrepareImpl#getMaterializations always reparses the SQL and down >>>>>> the line, there is a lot of expensive work(e.g. predicate and lineage >>>>>> determination) done during planning that could easily be pre-calculated >>>>>> and cached during materialization creation. >>>>>> >>>>>> There is also a bit of a thread safety problem with the current >>>>>> implementation. Unless there is a different safety mechanism that I >>>>>> don't see, the sharing of the MaterializationService and thus also the >>>>>> maps in MaterializationActor via a static instance between multiple >>>>>> threads is problematic. >>>>>> >>>>>> Since I mentioned thread safety, how is Calcite supposed to be used in a >>>>>> multi-threaded environment? Currently I use a connection pool that >>>>>> initializes the schema on new connections, but that is not really nice. >>>>>> I suppose caches are also bound to the connection? A thread safe context >>>>>> that can be shared between connections would be nice to avoid all that >>>>>> repetitive work. >>>>>> >>>>>> Are these known issues which you have thought about how to fix or should >>>>>> I log JIRAs for these and fix them to the best of my knowledge? I'd more >>>>>> or less keep the service shared but would implement it using a copy on >>>>>> write strategy since I'd expect seldom schema changes after startup. >>>>>> >>>>>> Regarding the repetitive work that partly happens during planning, I'd >>>>>> suggest doing that during materialization registration instead like it >>>>>> is already mentioned CalcitePrepareImpl#populateMaterializations. Would >>>>>> that be ok? >>>>>> >>>>>> -- >>>>>> >>>>>> Mit freundlichen Grüßen, >>>>>> ------------------------------------------------------------------------ >>>>>> *Christian Beikov* >>>>>> >
