[ 
https://issues.apache.org/jira/browse/CALCITE-3873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086507#comment-17086507
 ] 

neoremind commented on CALCITE-3873:
------------------------------------

The following is copied conclusion from github PR.

I have added new benchmark test case for baseline to disable guava caching. The 
result shows as below (code is in [my 
branch|https://github.com/neoremind/calcite/blob/refvisit_benchmark/ubenchmark/src/jmh/java/org/apache/calcite/benchmarks/ReflectiveVisitDispatcherTest2.java]).

{{Benchmark                                                           
(guavaCacheMaxSize)  Mode  Cnt    Score    Error  Units
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingGuavaCache               
      0  avgt    5  116.446 ±  8.957  ns/op
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingGuavaCache               
    128  avgt    5  133.678 ±  8.251  ns/op
ReflectiveVisitDispatcherTest102.testGlobalCachingUsingHashMap                  
      0  avgt    5   73.213 ±  7.614  ns/op
ReflectiveVisitDispatcherTest102.testInstanceCachingWithReflection              
      0  avgt    5  227.130 ±  9.033  ns/op}}

Table title are operations performed by each solution.
|| ||lookup method||get by key from map||put key value to map||time cost||
|{{testGlobalCachingUsingGuavaCache with guavaCacheMaxSize = 0}} (Disable 
cache, looking up method every time for each call)|Y| | |116.446 ns/op|
|{{testGlobalCachingUsingGuavaCache with guavaCacheMaxSize = 128}} (Global 
cache with guava cache)| |Y| |133.678 ns/op|
|{{testGlobalCachingUsingHashMap}} (Global cache with JDK ConcurrentHashMap)| 
|Y| |73.213 ns/op|
|{{testInstanceCachingWithReflection}} (Original implementation to cache per 
instance for each call)|Y|Y|Y|227.130 ns/op|

The conclusion we could get is
 # Guava cache overhead is a little too big, compared to JDK ConcurrentHashMap. 
But JDK ConcurrentHashMap is not limit sized, so in the previous discussion, we 
do not choose JDK ConcurrentHashMap.
 # Guava cache overhead is even bigger than looking up method for each call. 
This is what surprised me.
 # The original implementation is the slowest one because it has to get from 
map, look up method then put method to map, it combines all the operations, the 
solution works well in the same instance, but for every Calcite invocation it 
has to sacrifice performance to look up method every time, that is where the PR 
tries to improve.

The current master version does improve performance and balance memory usage. 
But it is slower than 1) JDK ConcurrentHashMap for caching (but not limit 
size), 2) looking up method direct in each call without map operations.

Also, I have tested against MethodHandle and LambdaMetaFactory.

The result is as below.

{{Benchmark                                                                     
           Mode  Cnt      Score       Error  Units
ReflectiveVisitDispatcherTest.testGlobalCaching                                 
         avgt    3    137.270 ±    32.275  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithLambdaFactory              
         avgt    3  92459.660 ± 56777.698  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithLambdaFactoryThreadLocalInitialize
  avgt    3    199.687 ±    24.755  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithMethodHandle               
         avgt    3   2421.131 ±   633.756  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithMethodHandleThreadLocalInitialize
   avgt    3    218.616 ±    41.754  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithReflection                 
         avgt    3    224.603 ±    22.770  ns/op
ReflectiveVisitDispatcherTest.testInstanceCachingWithReflectionThreadLocalInitialize
     avgt    3    218.406 ±     9.620  ns/op}}

>From the result, it seems the pre-work is time consuming for both 
>{{MethodHandle}} and {{LambdaFactory}} (built upon MethodHandle). To eliminate 
>the initialization effect, I try to use ThreadLocal to initialize once, the 
>result do verifies that, in terms of time cost, LambdaFactory < Reflection <= 
>MethodHandle, but the gap is very small.

With that, it will be a good option to stick to the current version. Hope the 
analysis and work could help people who interested in this. 

Many thanks to Vladimir, Haisheng, Danny, Chunwei, Stamatis to take time 
reviewing my code kindly :)

 

> Use global caching for ReflectiveVisitDispatcher implementation
> ---------------------------------------------------------------
>
>                 Key: CALCITE-3873
>                 URL: https://issues.apache.org/jira/browse/CALCITE-3873
>             Project: Calcite
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 1.22.0
>            Reporter: neoremind
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: ReflectVisitorDispatcherTest.java, jmh_result.txt, 
> pic1.svg, pic2.svg
>
>          Time Spent: 10h
>  Remaining Estimate: 0h
>
> For curiosity, I use flame graph to profiling a simple query. The code 
> snippet looks like below.
> {code:java}
>     String sql = "select empno, gender, name from EMPS where name = 'John'";
>     Connection connection = null;
>     Statement statement = null;
>     try {
>       Properties info = new Properties();
>       info.put("model", jsonPath("smart"));
>       connection = DriverManager.getConnection("jdbc:calcite:", info);      
>       String x = null;
>       for (int i = 0; i < 50000; i++) {
>         statement = connection.createStatement();
>         final ResultSet resultSet =
>             statement.executeQuery(
>                 sql);
>         while (resultSet.next()) {
>           x = resultSet.getInt(1)
>               + resultSet.getString(2)
>               + resultSet.getString(3);
>         }      
>       }
>     } catch (SQLException e) {
>       e.printStackTrace();
>     } finally {
>       close(connection, statement);
>     }
> {code}
>  
> I attach the generated flame graph [^pic1.svg]
> {code:java}
> 3% on sql2rel
> 9% on query optimizing,
> 62% of the time is spent on code gen and implementation,
> 20% on result set iterating and checking,
> … 
> {code}
> Hope this graph is informative. Since I start to learn Calcite recently, I 
> cannot tell where to start tuning, but from the graph one tiny point catches 
> my attention, I find there are many reflection invocations in 
> _Prepare#trimUnusedFields_. So, I spent some time trying to mitigate the 
> small overhead.
> I optimize _ReflectiveVisitDispatcher_ by introducing a global _Guava_ cache 
> with limited size to cache methods, also I add full unit tests for 
> _ReflectUtil_.
> I count the reference of the method: _ReflectUtil#createMethodDispatcher and_
> _ReflectUtil#createDispatcher (see below)._ Total 68 possible invocations, so 
> the cache size is limited, by caching all the methods during the lifecycle of 
> the process, we can eliminate reflection looking up methods overhead.
> {code:java}
> org.apache.calcite.rel.rel2sql.RelToSqlConverter: 18 possible invocations.
> org.apache.calcite.sql2rel.RelDecorrelator: 15 possible invocations.
> org.apache.calcite.sql2rel.RelFieldTrimmer: 11 possible invocations.
> org.apache.calcite.sql2rel.RelStructuredTypeFlattener.RewriteRelVisitor: 22 
> possible invocations.
> org.apache.calcite.interpreter.Interpreter.CompilerImpl: 2 possible 
> invocations.
> {code}
> Before introducing the global caching, caching is shared per 
> _ReflectiveVisitDispatcher_ instance, now different 
> _ReflectiveVisitDispatcher_ in different thread is able to reuse the cached 
> methods.
> See [^pic2.svg], after tuning, _trimUnusedFields_ only takes 0.64% of the 
> sampling time compared with 1.38% previously. I think this will help in a lot 
> more places.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to