GitHub user bdrillard opened a pull request:
https://github.com/apache/spark/pull/19518
[SPARK-18016][SQL][CATALYST] Code Generation: Constant Pool Limit - State
Compaction
## What changes were proposed in this pull request?
This PR is the part two followup to #18075, meant to address
[SPARK-18016](https://github.com/apache/spark/pull/SPARK-18016), Constant Pool
limit exceptions. Part 1 implemented `NestedClass` code splitting, in which
excess code was split off into nested private sub-classes of the `OuterClass`.
In Part 2 we address excess mutable state, in which the number of inlined
variables declared at the top of the `OuterClass` can also exceed the constant
pool limit.
Here, we modify the `addMutableState` function in the `CodeGenerator` to
check if the declared state can be easily initialized compacted into an array
and initialized in loops rather than inlined and initialized with its own line
of code. We identify four types of state that can compacted:
* Primitive state (ints, booleans, etc)
* Object state of like-type without any initial assignment
* Object state of like-type initialized to `null`
* Object state of like-type initialized to the type's base (no-argument)
constructor
With mutable state compaction, at the top of the class we generate array
declarations like:
```
private Object[] references;
private UnsafeRow result;
private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder
holder;
private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter
rowWriter;
...
private boolean[] mutableStateArray1 = new boolean[12507];
private InternalRow[] mutableStateArray4 = new InternalRow[5268];
private
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[]
mutableStateArray5 = new
org.apache.spark.sql.catalyst.expressions.codegen.UnsafeArrayWriter[7663];
private java.lang.String[] mutableStateArray2 = new java.lang.String[12477];
private int[] mutableStateArray = new int[42509];
private java.lang.Object[] mutableStateArray6 = new java.lang.Object[30];
private boolean[] mutableStateArray3 = new boolean[10536];
```
and these arrays are initialized in loops as:
```
private void init_3485() {
for (int i = 0; i < mutableStateArray3.length; i++) {
mutableStateArray3[i] = false;
}
}
```
For compacted mutable state, `addMutableState` returns an array accessor
value, which is then referenced in the subsequent generated code.
**Note**: some state cannot be easily compacted (except without perhaps
deeper changes to generating code), as some state value names are taken for
granted at the global level during code generation (see `CatalystToExternalMap`
in `Objects` as an example). For this state, we provide an `inline` hint to the
function call, which indicates that the state should be inlined to the
`OuterClass`. Still, the state we can easily compact manages to reduce the
Constant Pool to an tractable size for the wide/deeply nested schemas I was
able to test against.
## How was this patch tested?
Tested against several complex schema types, also added a test case
generating 40,000 string columns and creating the `UnsafeProjection`.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/bdrillard/spark state_compaction
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19518.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19518
----
commit 081bc5de6ee55e00ff58c4abddc347f77c29d4aa
Author: ALeksander Eskilson <[email protected]>
Date: 2017-10-17T14:06:12Z
adding state compaction
commit e7046c3d3bb528f18b3183d81e8bc26720a8baf7
Author: ALeksander Eskilson <[email protected]>
Date: 2017-10-17T16:54:54Z
adding inline changes
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]