[
https://issues.apache.org/jira/browse/CALCITE-2703?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16704984#comment-16704984
]
Stamatis Zampetakis commented on CALCITE-2703:
----------------------------------------------
[~vladimirsitnikov] Calcite's implementation of PreparedStatement ends up
calling EnumerableInterpretable#toBindable through
CalcitePrepareImpl.CalcitePreparingStmt#implement. All the code generation
magic happens there and to be more precise inside
EnumerableInterpretable#getBindable method.
[~julianhyde] I undestand your concern, but I still believe it is worth the
effort. I agree that the test query patterns are not like production patterns
but this doesn't change the fact the same query patterns appear multiple times.
It is very common that production workloads tend to repeat the same queries
multiple times. I would dare to say that production workloads do not really
introduce new queries very often. I am well aware of the fact that the
production system may have a performance bottleneck elsewhere but this doesn't
mean that code compilation/class loading comes for free. The fact that the test
suite needs *2h* *instead of 1h* because of that certainly raises a warning
flag.
To better showcase the advantages/disadvantages of using a cache, I did some
micro benchmarks (code included in the PR) using jmh. The full result report on
my local machine can be found
[here|https://docs.google.com/spreadsheets/d/1yVIjan8Nw-aCQOmvYLgj1PdlkrjeblNHihvfhnc5yf4/edit?usp=sharing].
For convenience, I provide below a small extract with the average response
times per operation:
||Benchmark||(cacheSize)||(queries)||avt||Units||
|CodeGenerationBenchmark.getBindableNoCache|N/A|1|11289368.49|ns/op|
| | |10|11826950.78|ns/op|
| | |100|12633361.83|ns/op|
| | |1000|13941583.80|ns/op|
|CodeGenerationBenchmark.getBindableWithCacheAndCandidateDetector|10|1|4864.99|ns/op|
| | |10|5916.71|ns/op|
| | |100|11749056.09|ns/op|
| | |1000|12479969.01|ns/op|
| |100|1|4102.78|ns/op|
| | |10|5377.52|ns/op|
| | |100|9507081.47|ns/op|
| | |1000|11093308.52|ns/op|
| |1000|1|5297.09|ns/op|
| | |10|7506.15|ns/op|
| | |100|10994651.69|ns/op|
| | |1000|15484723.50|ns/op|
+Note+: when the query workload above contains 100 or more queries most of the
operations lead to cache misses (since the measurement time is set to 1sec).
*Bindables without cache*
The existing code has a few drawbacks.
_1. Compilation/Class loading overhead_: I would like to comment the fact
that the current implementation (getBindableNoCache) takes in average 12ms
(with a hot VM this can go lower but still) which I consider a lot. For simple,
high selective, queries this time can easily be more than the time of really
executing the query. This is a general problem of compiled vs. interpreted
programs. Someone could argue that a few ms is not really a big deal so this
brings me to the second drawback.
_2. Metaspace pollution/Class unloading overhead:_ No cache, means that a new
class will be loaded for every query consuming both Heap and Metaspace memory.
This translates to higher memory requirements and increased activity of the
garbage collector affecting performance negatively. Altough, the cost of
loading classes rougly appears in the previous benchmark the cost of unloading
(performed by the gc) is more difficult to measure in a few benchmark
iterations.
_3. Increased JIT activity:_ Since new classes are generated for every query
the JIT compiler has to get in the middle in order to optimise the new bytecode
in order to make the queries run faster. Part of the high response times
reported above are attributable to JIT. JIT runs in separate threads, each one
occupying more than 50% of available cores processing capacity. In low threaded
applications, this can easily pass unoticed since running threads can always
find an available core but as the number of threads grows JIT threads will have
to compete with application threads affecting the overall performance of the
system.
*Bindables with cache*
Adding a cache layer has also a few drawbacks.
_1. Cache access overhead_: An additional cache layer means additional
overhead in every query for accessing the cache. In the case that we never hit
the cache we are paying an extra cost for nothing. In the previous benchmark
this penalty is negligible (i.e., ~2ms extra in the worst case) and hopefully
such scenarios should be rather rare.
_2. Heap space overhead_: The approach of using the Java code as the key for
the cache is not memory efficient either but can be mitigated by choosing a
relatively small size for the cache (e.g., 100, 1000).
_3. Code complexity_: An additional cache layer brings complexity issues in
terms of implementation and code maintenance.
*Conclusion*
Overall, I am rather positive in the idea of the cache for the various reasons
outlined above so I will be happy to discuss further. Let me know what do you
think!
> Reduce code generation and class loading overhead when executing queries in
> the EnumerableConvention
> ----------------------------------------------------------------------------------------------------
>
> Key: CALCITE-2703
> URL: https://issues.apache.org/jira/browse/CALCITE-2703
> Project: Calcite
> Issue Type: Improvement
> Components: core
> Affects Versions: 1.17.0
> Reporter: Stamatis Zampetakis
> Assignee: Julian Hyde
> Priority: Major
> Labels: performance
> Fix For: 1.18.0
>
>
> The queries using Calcite's EnumerableConvention always end-up generating new
> java classes at runtime (using Janino) that are then instantiated using
> reflection. This combination of class generation and class loading introduces
> a big overhead in query response time.
> A quick profiling on our Company's internal test suite consisting in 4000
> tests with roughly 430000 SQL queries passing through Calcite we observed
> that a big amount of time is spend on code generation and class loading
> making the EnumerableInterpretable#toBindable method a performance
> bottleneck.
> Among the 430000 SQL queries there are many duplicates which are going to
> lead to the generation of exactly the same code Java. Introducing, a small
> cache at the level of EnumerableInterpretable class could avoid generating
> and loading the same code over and over again.
> A simple implementation based on Guava improved the overall execution time of
> the afforementioned test suite by more than 50%.
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)