GitHub user mgaido91 opened a pull request:

    https://github.com/apache/spark/pull/19480

    [SPARK-22226] splitExpression can create too many method calls in the outer 
class

    ## What changes were proposed in this pull request?
    
    SPARK-18016 introduced {{NestedClass}} to avoid that the many methods 
generated by {{splitExpressions}} contribute to the outer class' constant pool, 
making it growing too much. Unfortunately, despite their definition is stored 
in the {{NestedClass}}, they all are invoked in the outer class and for each 
method invocation, there are two entries added to the constant pool: a 
{{Methodref}} and a {{Utf8}} entry (you can easily check this compiling a 
simple sample class with {{janinoc}} and looking at its Constant Pool). This 
limits the scalability of the solution with very large methods which are split 
in a lot of small ones. This means that currently we are generating classes 
like this one:
    
    ```
    class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
    ...
      public UnsafeRow apply(InternalRow i) {
         rowWriter.zeroOutNullBytes();
         apply_0(i);
         apply_1(i);
    ...
        nestedClassInstance.apply_862(i);
        nestedClassInstance.apply_863(i);
    ...
      }
    ...
      private class NestedClass {
        private void apply_862(InternalRow i) { ... }
        private void apply_863(InternalRow i) { ... }
    ...
      }
    }
    ```
    
    This PR reduce the Constant Pool size of the outer class by adding a new 
method to each nested class: in this method we invoke all the small methods 
generated by {{splitExpression}} in that nested class. In this way, in the 
outer class there is only one method invocation per nested class, reducing by 
orders of magnitude the entries in its constant pool because of method 
invocations. This means that after the patch the generated code becomes:
    
    ```
    class SpecificUnsafeProjection extends 
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
    ...
      public UnsafeRow apply(InternalRow i) {
         rowWriter.zeroOutNullBytes();
         apply_0(i);
         apply_1(i);
         ...
         nestedClassInstance.apply(i);
         nestedClassInstance1.apply(i);
         ...
      }
    ...
      private class NestedClass {
        private void apply_862(InternalRow i) { ... }
        private void apply_863(InternalRow i) { ... }
    ...
        private void apply(InternalRow i) {
          apply_862(i);
          apply_863(i);
          ...
        }
      }
    }
    ```
    
    ## How was this patch tested?
    
    Added UT and existing UTs


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/mgaido91/spark SPARK-22226

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19480.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19480
    
----
commit c2cc295fd85a7a0e42debc954311ff74f5b52962
Author: Marco Gaido <mga...@hortonworks.com>
Date:   2017-10-06T15:33:08Z

    add a method for each inner class and use it in the superclass

commit d3a5b872e5446e1205a91498d977af6e6259e58b
Author: Marco Gaido <mga...@hortonworks.com>
Date:   2017-10-09T14:21:45Z

    Adding UT and modifying class size limit

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to