GitHub user bdrillard opened a pull request:
https://github.com/apache/spark/pull/16648
[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit
[class_splitting] increasing stack size for Catalyst tests
## What changes were proposed in this pull request?
Supports code generation for large structures that would previously trigger
a Constant Pool limit exception, as noted in [SPARK-18016](SPARK-18016). In
this fix, when the volume of generated code for the class would exceed 1600k
bytes, a new private nested class is declared, and any new functions that would
have been inlined to the outer class with an `addNewFunction` call are inlined
to the new nested class instead. `addNewFunction` also would now return the
name of the function registered (class-qualified, if it would be inlined to a
nested class), so that the caller of the function can call it even if inlined
to a different class. Additional nested classes are generated if the threshold
is met subsequent times. These nested classes are instantiated and declared at
the bottom of the generated outer class.
Because private nested classes have access to the outer class's global
state, but their functions and local state do not count towards the outer
class's Constant Pool, and that they can be instantiated in the same outer
class without the need to declare additional classes and handle the dependency
injection, they seem to be a good candidate to solve this particular issue.
One key quality of this patch is that the common path for code generation
remains unaffected. The 1600k threshold necessary to split of a nested class
should only be exceeded in scenarios where the schema is extremely large.
Generated code for most use cases will still be inlined entirely to the single
outer class.
This patch splits code (only code registered through the `addNewFunction`
call) among the outer class and nested classes like below:
```
/* 6 */ class SpecificUnsafeProjection extends
org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 7 */ // Global "mutable" state
/* 8 */ private Object[] references;
/* 9 */ private int argValue;
...
/* 18863 */ private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter rowWriter578;
/* 18864 */ // Code inlined to the outer class
/* 18865 */ public SpecificUnsafeProjection(Object[] references) {
/* 18866 */ this.references = references;
/* 18867 */ nestedClassInstance.init_0();
/* 18868 */ nestedClassInstance.init_1();
...
/* 70344 */ public UnsafeRow apply(InternalRow i) {
/* 70345 */ nestedClassInstance5.apply589_0(i);
...
/* 70398 */ return result;
/* 70399 */ }
/* 70400 */ // Instantiation of nested classes
/* 70401 */ private NestedClass5 nestedClassInstance5 = new
NestedClass();
/* 70402 */ private NestedClass4 nestedClassInstance4 = new
NestedClass();
...
/* 70406 */ private NestedClass nestedClassInstance = new
NestedClass();
/* 70407 */ // Declaration of a nested class
/* 70408 */ private NestedClass5 {
/* 70409 */ // Code inlined to a nested class
/* 70410 */ private void apply519_0(InternalRow fieldName1105) {
...
/* 340829 */ } // end of last nested class
/* 348030 */ } // end of the outer class
```
## How was this patch tested?
Added a new test to the DataframeComplexTypeSuite that tests converting a
large structure to a dataset. Ran full regression tests across every module.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/bdrillard/spark class_splitting
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16648.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16648
----
commit 85e81ed42a8f109f6e66167adadd10b096bbd678
Author: ALeksander Eskilson <[email protected]>
Date: 2017-01-11T19:33:26Z
adding initial class splitting
[class_splitting] increasing stack size for Catalyst tests
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]