Hive does not use the Calcite SQL parser, thus we follow a different path
and did not experience the problem on the Calcite end. However, FWIW we
avoided reparsing the SQL every time a query was being planned by
creating/managing our own cache too.

The metadata providers implement some caching, thus I would expect that once
you avoid reparsing every MV, the retrieval time of predicates, lineage, etc.
would improve (at least after using the MV for the first time). However,
I agree that the information should be inferred when the MV is loaded.
In fact, maybe just making some calls to the metadata providers while the MVs
are being loaded would do the trick (Julian should confirm this).

Btw, probably you will find another scalability issue as the number of MVs
grows large with the current implementation of the rewriting, since the´
pre-filtering implementation in place does not discard many of the views that
are not valid to rewrite a given query, and rewriting is attempted with all
of them.
This last bit is work that I would like to tackle shortly, but I have not
created the corresponding JIRA yet.

-Jesús
 



On 8/27/17, 10:43 PM, "Rajat Venkatesh" <rvenkat...@qubole.com> wrote:

>Thread Safety and repeated parsing is a problem. We have experience with
>managing 10s of materialized views. Repeated parsing takes more time than
>execution of the query itself. We also have a similar problem where
>concurrent queries (with a different set of materialized views potentailly)
>maybe planned at the same time. We solved it through maintaining a cache
>and carefully setting the cache in a thread local.
>Relevant code for inspiration:
>https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/prepare/Materializer.java
>https://github.com/qubole/quark/blob/master/optimizer/src/main/java/org/apache/calcite/plan/QuarkMaterializeCluster.java
>
>
>
>On Sun, Aug 27, 2017 at 6:50 PM Christian Beikov <christian.bei...@gmail.com>
>wrote:
>
>> Hey, I have been looking a bit into how materialized views perform
>> during the planning because of a very long test
>> run(MaterializationTest#testJoinMaterializationUKFK6) and the current
>> state is problematic.
>>
>> CalcitePrepareImpl#getMaterializations always reparses the SQL and down
>> the line, there is a lot of expensive work(e.g. predicate and lineage
>> determination) done during planning that could easily be pre-calculated
>> and cached during materialization creation.
>>
>> There is also a bit of a thread safety problem with the current
>> implementation. Unless there is a different safety mechanism that I
>> don't see, the sharing of the MaterializationService and thus also the
>> maps in MaterializationActor via a static instance between multiple
>> threads is problematic.
>>
>> Since I mentioned thread safety, how is Calcite supposed to be used in a
>> multi-threaded environment? Currently I use a connection pool that
>> initializes the schema on new connections, but that is not really nice.
>> I suppose caches are also bound to the connection? A thread safe context
>> that can be shared between connections would be nice to avoid all that
>> repetitive work.
>>
>> Are these known issues which you have thought about how to fix or should
>> I log JIRAs for these and fix them to the best of my knowledge? I'd more
>> or less keep the service shared but would implement it using a copy on
>> write strategy since I'd expect seldom schema changes after startup.
>>
>> Regarding the repetitive work that partly happens during planning, I'd
>> suggest doing that during materialization registration instead like it
>> is already mentioned CalcitePrepareImpl#populateMaterializations. Would
>> that be ok?
>>
>> --
>>
>> Mit freundlichen Grüßen,
>> ------------------------------------------------------------------------
>> *Christian Beikov*
>>

Reply via email to