[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...

kiszk Tue, 10 Oct 2017 01:25:00 -0700

Github user kiszk commented on the issue:

    https://github.com/apache/spark/pull/19082
  
    Let me summarize recent interesting PRs for code generation regarding JVM 
bytecode limit for JIT compilation. These PRs encourages to apply JIT 
compilation to more methods since most of JIT compilers stop performing JIT 
compilation for a method with larger size (e.g. 8000 byte in HotSpot compiler). 
There are two categories for PRs.
    1. limit the total JVM bytecode size of the generated Java method (#18810, 
#19083)
    2. Generate a Java method with smaller size (#18931, #19082).
    
    I think that both categories are complementary. I like these activities.
    
    In category 1., it is to disable a whole-stage codegen for a large method 
(i.e. more than 8000 JVM byte code) that will not be JIT-compiled.  
    #18810 tries to **estimate whether JVM bytecode size** is less than 8000 or 
not by using the number of lines of a method. The threshold of the line is 
2667. If estimated bytecode size more than 8000, whole-stage codegen is 
disabled. This threshold worked well for most of programs. However, as @maropu 
summarized 
[here](https://github.com/apache/spark/pull/19082#issuecomment-335336076), it 
did not work for some program (e.g. 
[q66](https://github.com/apache/spark/pull/18810#issuecomment-323620029)).  
    Then, #19083 **checks actual JVM bytecode size** by using the compiled JVM 
bytecode. This PR can precisely avoids not to perform JIT compilation.  
    Category 1. cannot encourage JIT compilation to the whole-staged method.
    
    In category 2., code generation in each part tries to smaller methods (i.e. 
8000 JVM byte codes per method) to apply JIT compilations to more methods or to 
avoid JVM byte code generation failure (beyond 64KB per methods). This will not 
dis  
    One of activities is to use `CodeGenerator.splitExpressions()`.  
    #18931 splits a set of `comsume()` functions in a physical plan [into 
multiple 
methods](https://github.com/apache/spark/pull/18931#issuecomment-325907224) 
instead of embedding into one method (e.g. `processNext()`).  
    #19082 splits operations in aggregation into multiple methods instead of 
embedding into one method (e.g. `agg_doAggregateWithoutKey()`).  
    Even if these PRs create smaller methods, *JIT compiler can make 
compilation unit larger* by applying method inlining. To make compilation unit 
larger encourages to apply more optimizations in the compilation. For example, 
in HotSpot, a method whose JVM bytecode size is up to [325 (frequently 
executed)](http://isuru-perera.blogspot.jp/2014/12/java-jit-compilation-inlining-jitwatch.html)
 or [35 
(normal)](http://www.oracle.com/technetwork/java/vmoptions-jsp-140102.html) 
will be inlined. Thus, I think that we will rarely see performance regression.  
    Category 2. tries to encourage JIT compilation to the whole-staged method 
by making its method size smaller.
    
    @gatorsmile, @viirya, @maropu, @rednaxelafx what do you think? Do you have 
any comments or questions?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #19082: [SPARK-21870][SQL] Split aggregation code into small fun...

Reply via email to