Github user bdrillard commented on the issue:
https://github.com/apache/spark/pull/16648
@cloud-fan Good question, and I think we can resolve it by using different
values for `N` in the
[testcase](https://github.com/apache/spark/pull/18075/files#diff-a14107cf4a4c41671bba24a82f6042d9R87)
I have in the other PR (which will translate to a number of string columns
deeper in the test). At `N = 4000`, we have a threshold where the amount of
local state + global state would trigger a `JaninoRuntimeException` on the
constant pool. #18075 can fix that issue at `N = 4000` by beginning to inline
functions to nested classes, thus reducing the amount of items counting towards
the constant pool, but we note that #18075 does nothing to address global
state.
We should also note that #18075 does slightly more than putting just member
variables into nested classes. While it is true that a significant degree of
local state alone that would get inlined to the Outer Class gets inlined
instead to nested classes instead with #18075, the patch leads to even more
reductions in the size of the constant pool, since there are additional items
that get inlined to nested classes that also count towards the limit (e.g.
field references, method references, variable types, method types, etc, see
[Java class file, The Constant
Pool](https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool)).
The second feature (included here, but not in #18075), is precisely as you
describe: it takes simply declared fields that would be inlined globally and
compacts them into like-typed and like-declared arrays.
However, if we set `N = 8000` (even assuming the patch in #18075), we can
trigger yet another `JaninoRuntimeException`, this time because the amount of
global state (plus any local state that was inlined to the Outer Class and not
any single subclass) is sufficiently great to cause the exception. However, if
we include mutable state compaction and class splitting, like using the method
included in this PR #16648, we can set `N` to a value greater than 10,000 (I
had success for the test still at `N = 12000`, but at 16,000 my machine began
to thrash).
Conversely, if we only include mutable state compaction at `N = 8000`, and
exclude class splitting, there are instances where we actually end up with very
little global state, but the amount of local state and functions inlined to the
Outer Class is still sufficient to exceed the constant pool limit. This can
occur if we have a great number of primitive columns, like `N = 8000` integer
columns.
Looking at both #18075 and this pull-request, I think the takeaway is that
even if all we do for the moment is split excess code among nested classes, we
can still make a significant gain in the number of columns a Dataset can hold,
which gives #18075 merit on its own. If we want to increase that limit even
more though, we'll have to address proliferation of global state as well,
perhaps by opening a follow-up PR that focuses on it more closely, maybe using
the compaction strategy I've attempted here in #16648, or by exploring another
method.
Thoughts on this?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]