GitHub user mgaido91 opened a pull request:
https://github.com/apache/spark/pull/19480
[SPARK-22226] splitExpression can create too many method calls in the outer
class
## What changes were proposed in this pull request?
SPARK-18016 introduced {{NestedClass}} to avoid that the many methods
generated by {{splitExpressions}} contribute to the outer class' constant pool,
making it growing too much. Unfortunately, despite their definition is stored
in the {{NestedClass}}, they all are invoked in the outer class and for each
method invocation, there are two entries added to the constant pool: a
{{Methodref}} and a {{Utf8}} entry (you can easily check this compiling a
simple sample class with {{janinoc}} and looking at its Constant Pool). This
limits the scalability of the solution with very large methods which are split
in a lot of small ones. This means that currently we are generating classes
like this one:
```
class SpecificUnsafeProjection extends
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
public UnsafeRow apply(InternalRow i) {
rowWriter.zeroOutNullBytes();
apply_0(i);
apply_1(i);
...
nestedClassInstance.apply_862(i);
nestedClassInstance.apply_863(i);
...
}
...
private class NestedClass {
private void apply_862(InternalRow i) { ... }
private void apply_863(InternalRow i) { ... }
...
}
}
```
This PR reduce the Constant Pool size of the outer class by adding a new
method to each nested class: in this method we invoke all the small methods
generated by {{splitExpression}} in that nested class. In this way, in the
outer class there is only one method invocation per nested class, reducing by
orders of magnitude the entries in its constant pool because of method
invocations. This means that after the patch the generated code becomes:
```
class SpecificUnsafeProjection extends
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
...
public UnsafeRow apply(InternalRow i) {
rowWriter.zeroOutNullBytes();
apply_0(i);
apply_1(i);
...
nestedClassInstance.apply(i);
nestedClassInstance1.apply(i);
...
}
...
private class NestedClass {
private void apply_862(InternalRow i) { ... }
private void apply_863(InternalRow i) { ... }
...
private void apply(InternalRow i) {
apply_862(i);
apply_863(i);
...
}
}
}
```
## How was this patch tested?
Added UT and existing UTs
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/mgaido91/spark SPARK-22226
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19480.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19480
----
commit c2cc295fd85a7a0e42debc954311ff74f5b52962
Author: Marco Gaido <[email protected]>
Date: 2017-10-06T15:33:08Z
add a method for each inner class and use it in the superclass
commit d3a5b872e5446e1205a91498d977af6e6259e58b
Author: Marco Gaido <[email protected]>
Date: 2017-10-09T14:21:45Z
Adding UT and modifying class size limit
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]