Re: Materialization performance

Jesus Camacho Rodriguez Tue, 29 Aug 2017 08:26:18 -0700

LGTM, I think by the time we have support for the outer joins, I might have
had time to finish the filter tree index implementation too.


-Jesús



On 8/29/17, 3:11 AM, "Christian Beikov" <[email protected]> wrote:

>I'd like to stick to trying to figure out how to support outer joins for 
>now and when I have an implementation for that, I'd look into the filter 
>tree index if you haven't done it by then.
>
>
>Mit freundlichen Grüßen,
>------------------------------------------------------------------------
>*Christian Beikov*
>Am 28.08.2017 um 20:01 schrieb Jesus Camacho Rodriguez:
>> Christian,
>>
>> The implementation of the filter tree index is what I was referring to
>> indeed. In the initial implementation I focused on the rewriting coverage,
>> but now that the first part is finished, it is at the top of my list as
>> I think it is critical to make the whole query rewriting algorithm work
>> at scale. However, I have not started yet.
>>
>> The filter tree index will help to filter not only based on the tables used
>> by a given query, but also for queries that do not meet the equivalence
>> classes conditions, filter conditions, etc. We could implement all the
>> preconditions mentioned in the paper, and we could add our own additional
>> ones. I also think that in a second version, we might need to maybe add
>> some kind of ranking/limit as many views might meet the preconditions for
>> a given query.
>>
>> It seems you understood how it should work, so if you could help to
>> quickstart that work by maybe implementing a first version of the filter
>> tree index with a couple of basic conditions (table matching and EC 
>> matching?),
>> that would be great. I could review any of the contributions you make.
>>
>> -Jesús
>>
>>
>>
>>
>>
>> On 8/28/17, 3:22 AM, "Christian Beikov" <[email protected]> wrote:
>>
>>> If the metadata was cached, that would be awesome, especially because
>>> that would also improve the prformance regarding the metadata retrival
>>> for the query currently being planned, although I am not sure how the
>>> caching would work since the RelNodes are mutable.
>>>
>>> Have you considered implementing the filter tree index explained in the
>>> paper? As far as I understood, the whole thing only works when a
>>> redundant table elimination is implemented. Is that the case? If so, or
>>> if it can be done easily, I'd propose we initialize all the lookup
>>> structures during registration and use them during planning. This will
>>> improve planning time drastically and essentially handle the scalability
>>> problem you mention.
>>>
>>> What other MV-related issues are on your personal todo list Jesus? I
>>> read the paper now and think I can help you in one place or another if
>>> you want.
>>>
>>>
>>> Mit freundlichen Grüßen,
>>> ------------------------------------------------------------------------
>>> *Christian Beikov*
>>> Am 28.08.2017 um 08:13 schrieb Jesus Camacho Rodriguez:
>>>> Hive does not use the Calcite SQL parser, thus we follow a different path
>>>> and did not experience the problem on the Calcite end. However, FWIW we
>>>> avoided reparsing the SQL every time a query was being planned by
>>>> creating/managing our own cache too.
>>>>
>>>> The metadata providers implement some caching, thus I would expect that 
>>>> once
>>>> you avoid reparsing every MV, the retrieval time of predicates, lineage, 
>>>> etc.
>>>> would improve (at least after using the MV for the first time). However,
>>>> I agree that the information should be inferred when the MV is loaded.
>>>> In fact, maybe just making some calls to the metadata providers while the 
>>>> MVs
>>>> are being loaded would do the trick (Julian should confirm this).
>>>>
>>>> Btw, probably you will find another scalability issue as the number of MVs
>>>> grows large with the current implementation of the rewriting, since the´
>>>> pre-filtering implementation in place does not discard many of the views 
>>>> that
>>>> are not valid to rewrite a given query, and rewriting is attempted with all
>>>> of them.
>>>> This last bit is work that I would like to tackle shortly, but I have not
>>>> created the corresponding JIRA yet.
>>>>
>>>> -Jesús
>>>>    
>>>>
>>>>
>>>>
>>>> On 8/27/17, 10:43 PM, "Rajat Venkatesh" <[email protected]> wrote:
>>>>
>>>>> Thread Safety and repeated parsing is a problem. We have experience with
>>>>> managing 10s of materialized views. Repeated parsing takes more time than
>>>>> execution of the query itself. We also have a similar problem where
>>>>> concurrent queries (with a different set of materialized views 
>>>>> potentailly)
>>>>> maybe planned at the same time. We solved it through maintaining a cache
>>>>> and carefully setting the cache in a thread local.
>>>>> Relevant code for inspiration:
>>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/prepare/Materializer.java
>>>>> https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/plan/QuarkMaterializeCluster.java
>>>>>
>>>>>
>>>>>
>>>>> On Sun, Aug 27, 2017 at 6:50 PM Christian Beikov 
>>>>> <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey, I have been looking a bit into how materialized views perform
>>>>>> during the planning because of a very long test
>>>>>> run(MaterializationTest#testJoinMaterializationUKFK6) and the current
>>>>>> state is problematic.
>>>>>>
>>>>>> CalcitePrepareImpl#getMaterializations always reparses the SQL and down
>>>>>> the line, there is a lot of expensive work(e.g. predicate and lineage
>>>>>> determination) done during planning that could easily be pre-calculated
>>>>>> and cached during materialization creation.
>>>>>>
>>>>>> There is also a bit of a thread safety problem with the current
>>>>>> implementation. Unless there is a different safety mechanism that I
>>>>>> don't see, the sharing of the MaterializationService and thus also the
>>>>>> maps in MaterializationActor via a static instance between multiple
>>>>>> threads is problematic.
>>>>>>
>>>>>> Since I mentioned thread safety, how is Calcite supposed to be used in a
>>>>>> multi-threaded environment? Currently I use a connection pool that
>>>>>> initializes the schema on new connections, but that is not really nice.
>>>>>> I suppose caches are also bound to the connection? A thread safe context
>>>>>> that can be shared between connections would be nice to avoid all that
>>>>>> repetitive work.
>>>>>>
>>>>>> Are these known issues which you have thought about how to fix or should
>>>>>> I log JIRAs for these and fix them to the best of my knowledge? I'd more
>>>>>> or less keep the service shared but would implement it using a copy on
>>>>>> write strategy since I'd expect seldom schema changes after startup.
>>>>>>
>>>>>> Regarding the repetitive work that partly happens during planning, I'd
>>>>>> suggest doing that during materialization registration instead like it
>>>>>> is already mentioned CalcitePrepareImpl#populateMaterializations. Would
>>>>>> that be ok?
>>>>>>
>>>>>> --
>>>>>>
>>>>>> Mit freundlichen Grüßen,
>>>>>> ------------------------------------------------------------------------
>>>>>> *Christian Beikov*
>>>>>>
>

Re: Materialization performance

Reply via email to