[jira] [Commented] (CALCITE-2703) Reduce code generation and class loading overhead when executing queries in the EnumerableConvention

Stamatis Zampetakis (JIRA) Fri, 30 Nov 2018 08:34:57 -0800


    [ 
https://issues.apache.org/jira/browse/CALCITE-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704984#comment-16704984
 ]


Stamatis Zampetakis commented on CALCITE-2703:
----------------------------------------------

[~vladimirsitnikov] Calcite's implementation of PreparedStatement ends up 
calling EnumerableInterpretable#toBindable through 
CalcitePrepareImpl.CalcitePreparingStmt#implement. All the code generation 
magic happens there and to be more precise inside 
EnumerableInterpretable#getBindable method.
  
 [~julianhyde] I undestand your concern, but I still believe it is worth the 
effort. I agree that the test query patterns are not like production patterns 
but this doesn't change the fact the same query patterns appear multiple times. 
It is very common that production workloads tend to repeat the same queries 
multiple times. I would dare to say that production workloads do not really 
introduce new queries very often. I am well aware of the fact that the 
production system may have a performance bottleneck elsewhere but this doesn't 
mean that code compilation/class loading comes for free. The fact that the test 
suite needs *2h* *instead of 1h* because of  that certainly raises a warning 
flag.
  
 To better showcase the advantages/disadvantages of using a cache, I did some 
micro benchmarks (code included in the PR) using jmh. The full result report on 
my local machine can be found 
[here|https://docs.google.com/spreadsheets/d/1yVIjan8Nw-aCQOmvYLgj1PdlkrjeblNHihvfhnc5yf4/edit?usp=sharing].
 For convenience, I provide below a small extract with the average response 
times per operation:
||Benchmark||(cacheSize)||(queries)||avt||Units||
|CodeGenerationBenchmark.getBindableNoCache|N/A|1|11289368.49|ns/op|
| | |10|11826950.78|ns/op|
| | |100|12633361.83|ns/op|
| | |1000|13941583.80|ns/op|
|CodeGenerationBenchmark.getBindableWithCacheAndCandidateDetector|10|1|4864.99|ns/op|
| | |10|5916.71|ns/op|
| | |100|11749056.09|ns/op|
| | |1000|12479969.01|ns/op|
| |100|1|4102.78|ns/op|
| | |10|5377.52|ns/op|
| | |100|9507081.47|ns/op|
| | |1000|11093308.52|ns/op|
| |1000|1|5297.09|ns/op|
| | |10|7506.15|ns/op|
| | |100|10994651.69|ns/op|
| | |1000|15484723.50|ns/op|

+Note+: when the query workload above contains 100 or more queries most of the 
operations lead to cache misses (since the measurement time is set to 1sec).

*Bindables without cache*
  
 The existing code has a few drawbacks.
  
 _1. Compilation/Class loading overhead_: I would like  to comment the fact 
that the current implementation (getBindableNoCache) takes in average 12ms 
(with a hot VM this can go lower but still) which I consider a lot. For simple, 
high selective, queries this time can easily be more than the time of really 
executing the query. This is a general problem of compiled vs. interpreted 
programs. Someone could argue that a few ms is not really a big deal so this 
brings me to the second drawback.
  
 _2. Metaspace pollution/Class unloading overhead:_ No cache, means that a new 
class will be loaded for every query consuming both Heap and Metaspace memory. 
This translates to higher memory requirements and increased activity of the 
garbage collector affecting performance negatively. Altough, the cost of 
loading classes rougly appears in the previous benchmark the cost of unloading 
(performed by the gc) is more difficult to measure in a few benchmark 
iterations.

_3. Increased JIT activity:_ Since new classes  are generated for every query 
the JIT compiler has to get in the middle in order to optimise the new bytecode 
in order to make the queries run faster. Part of the high response times 
reported above are attributable to JIT. JIT runs in separate threads, each one 
occupying more than 50% of available cores processing capacity. In low threaded 
applications, this can easily pass unoticed since running threads can always 
find an available core but as the number of threads grows JIT threads will have 
to compete with application threads affecting the overall performance of the 
system. 
  
 *Bindables with cache*
  
 Adding a cache layer has also a few drawbacks.
  
 _1. Cache access overhead_: An additional cache layer means additional 
overhead in every query for accessing the cache. In the case that we never hit 
the cache we are paying an extra cost for nothing. In the previous benchmark 
this penalty is negligible (i.e., ~2ms extra in the worst case) and hopefully 
such scenarios should be rather rare. 
  
 _2. Heap space overhead_: The approach of using the Java code as the key for 
the cache is not memory efficient either but can be mitigated by choosing a 
relatively small size for the cache (e.g., 100, 1000).

_3. Code complexity_: An additional cache layer brings complexity issues in 
terms of implementation and code maintenance.

*Conclusion*
 Overall, I am rather positive in the idea of the cache for the various reasons 
outlined above so I will be happy to discuss further. Let me know what do you 
think!

> Reduce code generation and class loading overhead when executing queries in 
> the EnumerableConvention
> ----------------------------------------------------------------------------------------------------
>
>                 Key: CALCITE-2703
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2703
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.17.0
>            Reporter: Stamatis Zampetakis
>            Assignee: Julian Hyde
>            Priority: Major
>              Labels: performance
>             Fix For: 1.18.0
>
>
> The queries using Calcite's EnumerableConvention always end-up generating new 
> java classes at runtime (using Janino) that are then instantiated using 
> reflection. This combination of class generation and class loading introduces 
> a big overhead in query response time.
> A quick profiling on our Company's internal test suite consisting in 4000 
> tests with roughly 430000 SQL queries passing through Calcite we observed 
> that a big amount of time is spend on code generation and class loading 
> making the EnumerableInterpretable#toBindable method a performance 
> bottleneck. 
> Among the 430000 SQL queries there are many duplicates which are going to 
> lead to the generation of exactly the same code Java. Introducing, a small 
> cache at the level of EnumerableInterpretable class could avoid generating 
> and loading the same code over and over again.
> A simple implementation based on Guava improved the overall execution time of 
> the afforementioned test suite by more than 50%.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CALCITE-2703) Reduce code generation and class loading overhead when executing queries in the EnumerableConvention

Reply via email to