[GitHub] spark pull request #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

bdrillard Thu, 19 Jan 2017 08:12:17 -0800

GitHub user bdrillard opened a pull request:

    https://github.com/apache/spark/pull/16648


    [SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit

    [class_splitting] increasing stack size for Catalyst tests
    
    ## What changes were proposed in this pull request?
    
    Supports code generation for large structures that would previously trigger 
a Constant Pool limit exception, as noted in [SPARK-18016](SPARK-18016). In 
this fix, when the volume of generated code for the class would exceed 1600k 
bytes, a new private nested class is declared, and any new functions that would 
have been inlined to the outer class with an `addNewFunction` call are inlined 
to the new nested class instead. `addNewFunction` also would now return the 
name of the function registered (class-qualified, if it would be inlined to a 
nested class), so that the caller of the function can call it even if inlined 
to a different class. Additional nested classes are generated if the threshold 
is met subsequent times. These nested classes are instantiated and declared at 
the bottom of the generated outer class. 
    
    Because private nested classes have access to the outer class's global 
state, but their functions and local state do not count towards the outer 
class's Constant Pool, and that they can be instantiated in the same outer 
class without the need to declare additional classes and handle the dependency 
injection, they seem to be a good candidate to solve this particular issue.
    
    One key quality of this patch is that the common path for code generation 
remains unaffected. The 1600k threshold necessary to split of a nested class 
should only be exceeded in scenarios where the schema is extremely large. 
Generated code for most use cases will still be inlined entirely to the single 
outer class.
    
    This patch splits code (only code registered through the `addNewFunction` 
call) among the outer class and nested classes like below:
    
    ```
        /* 6 */ class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
        /* 7 */      // Global "mutable" state
        /* 8 */      private Object[] references;
        /* 9 */      private int argValue;
         ...
     /* 18863 */     private 
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter578;
     /* 18864 */     // Code inlined to the outer class
     /* 18865 */     public SpecificUnsafeProjection(Object[] references) {
     /* 18866 */         this.references = references;
     /* 18867 */         nestedClassInstance.init_0();
     /* 18868 */         nestedClassInstance.init_1();
         ...              
     /* 70344 */     public UnsafeRow apply(InternalRow i) {
     /* 70345 */         nestedClassInstance5.apply589_0(i);
         ...
     /* 70398 */         return result;
     /* 70399 */     }
     /* 70400 */     // Instantiation of nested classes
     /* 70401 */     private NestedClass5 nestedClassInstance5 = new 
NestedClass();
     /* 70402 */     private NestedClass4 nestedClassInstance4 = new 
NestedClass();
         ...
     /* 70406 */     private NestedClass nestedClassInstance = new 
NestedClass();
     /* 70407 */     // Declaration of a nested class
     /* 70408 */     private NestedClass5 {
     /* 70409 */         // Code inlined to a nested class
     /* 70410 */         private void apply519_0(InternalRow fieldName1105) {
         ...
    /* 340829 */     } // end of last nested class
    /* 348030 */ } // end of the outer class
    ```
    
    ## How was this patch tested?
    
    Added a new test to the DataframeComplexTypeSuite that tests converting a 
large structure to a dataset. Ran full regression tests across every module.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/bdrillard/spark class_splitting

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16648.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16648
    
----
commit 85e81ed42a8f109f6e66167adadd10b096bbd678
Author: ALeksander Eskilson <[email protected]>
Date:   2017-01-11T19:33:26Z

    adding initial class splitting
    
    [class_splitting] increasing stack size for Catalyst tests

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16648: [SPARK-18016][SQL][CATALYST] Code Generation: Con...

Reply via email to