GitHub user rednaxelafx opened a pull request:
https://github.com/apache/spark/pull/18255
[SPARK-21038][SQL] Reduce redundant generated init code in Catalyst codegen
## What changes were proposed in this pull request?
In Java, instance fields are guaranteed to be first initialized to their
corresponding default values (zero values) before the constructor is invoked.
Thus, explicitly code to initialize fields to their zero values is redundant
and should be avoided.
It's usually harmless to have such code in hand-written constructors, but
in the case of mechanically generating code, such code could contribute to a
significant portion of the code size and cause issues.
This ticket is a step in reducing the likelihood of hitting the 64KB
bytecode method size limit in the Java Class files. This PR uses simple
heuristics to filter out redundant code of initializing mutable state to their
default (zero) values in `CodegenContext.addMutableState`, by matching string
patterns. Basically, if the `initCode` in the added mutable state is like:
```
[this.]${variableName} = ${defaultValue};
```
(where `[this.]` indicates it's optional)
Then it'll be replaced by an empty string instead.
This pattern will catch the most common cases. An example of drawn from
production is:
```
/* 262651 */ value24709 = -1L;
/* 262652 */ isNull24710 = false;
/* 262653 */ value24710 = -1L;
/* 262654 */ isNull24718 = false;
/* 262655 */ value24718 = false;
/* 262656 */ isNull24719 = false;
/* 262657 */ value24719 = -1L;
/* 262658 */ isNull24720 = false;
/* 262659 */ value24720 = -1L;
/* 262660 */ isNull24728 = false;
/* 262661 */ value24728 = false;
/* 262662 */ isNull24729 = false;
/* 262663 */ value24729 = -1L;
/* 262664 */ isNull24730 = false;
/* 262665 */ value24730 = -1L;
/* 262666 */ isNull24738 = false;
...
```
where all of the `isNullNNN = false;` initialization code is redundant.
Alternatives are:
* Manually go through all the places where redundant initialization code is
specified in `addMutableState()` call sites, and change them to empty string.
That's tedious and involves changing a lot of code. But if we do it this way,
we could considering giving `initCode` an default value of `""` in the
declaration of `addMutableState()`, which could be a nice improvement too.
* Instead of using a simple regular-expression-based heuristic, use an
actual parse tree to verify the redundant code pattern. That would make it very
precise and be able to catch all cases, but it'll involve a bigger change,
potentially changing a bit of infrastructure in the codegen, e.g. codegen with
parse tree (AST) rather than strings. I'd like to take baby steps first and
move to more heavyweight options later.
## How was this patch tested?
Ran SQL and Catalyst unit tests. Also added some unit tests for the new
filtering heuristic.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/rednaxelafx/apache-spark
codegen-reduce-redundant-init
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18255.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18255
----
commit de9ce5b19719a629ceccc48389e856a4c1959174
Author: Kris Mok <[email protected]>
Date: 2017-06-08T17:46:15Z
Normalize the initCode in CodegenContext's mutableState by filtering out
redundant
default initialization code with a simple heuristic.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]