GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19480
[SPARK-22226] splitExpression can create too many method calls in the outer class ## What changes were proposed in this pull request? SPARK-18016 introduced {{NestedClass}} to avoid that the many methods generated by {{splitExpressions}} contribute to the outer class' constant pool, making it growing too much. Unfortunately, despite their definition is stored in the {{NestedClass}}, they all are invoked in the outer class and for each method invocation, there are two entries added to the constant pool: a {{Methodref}} and a {{Utf8}} entry (you can easily check this compiling a simple sample class with {{janinoc}} and looking at its Constant Pool). This limits the scalability of the solution with very large methods which are split in a lot of small ones. This means that currently we are generating classes like this one: ``` class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... public UnsafeRow apply(InternalRow i) { rowWriter.zeroOutNullBytes(); apply_0(i); apply_1(i); ... nestedClassInstance.apply_862(i); nestedClassInstance.apply_863(i); ... } ... private class NestedClass { private void apply_862(InternalRow i) { ... } private void apply_863(InternalRow i) { ... } ... } } ``` This PR reduce the Constant Pool size of the outer class by adding a new method to each nested class: in this method we invoke all the small methods generated by {{splitExpression}} in that nested class. In this way, in the outer class there is only one method invocation per nested class, reducing by orders of magnitude the entries in its constant pool because of method invocations. This means that after the patch the generated code becomes: ``` class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... public UnsafeRow apply(InternalRow i) { rowWriter.zeroOutNullBytes(); apply_0(i); apply_1(i); ... nestedClassInstance.apply(i); nestedClassInstance1.apply(i); ... } ... private class NestedClass { private void apply_862(InternalRow i) { ... } private void apply_863(InternalRow i) { ... } ... private void apply(InternalRow i) { apply_862(i); apply_863(i); ... } } } ``` ## How was this patch tested? Added UT and existing UTs You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22226 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19480.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19480 ---- commit c2cc295fd85a7a0e42debc954311ff74f5b52962 Author: Marco Gaido <mga...@hortonworks.com> Date: 2017-10-06T15:33:08Z add a method for each inner class and use it in the superclass commit d3a5b872e5446e1205a91498d977af6e6259e58b Author: Marco Gaido <mga...@hortonworks.com> Date: 2017-10-09T14:21:45Z Adding UT and modifying class size limit ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org