Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/16648
  
    @cloud-fan Good question, and I think we can resolve it by using different 
values for `N` in the 
[testcase](https://github.com/apache/spark/pull/18075/files#diff-a14107cf4a4c41671bba24a82f6042d9R87)
 I have in the other PR (which will translate to a number of string columns 
deeper in the test). At `N = 4000`, we have a threshold where the amount of 
local state + global state would trigger a `JaninoRuntimeException` on the 
constant pool. #18075 can fix that issue at `N = 4000` by beginning to inline 
functions to nested classes, thus reducing the amount of items counting towards 
the constant pool, but we note that #18075 does nothing to address global 
state. 
    
    We should also note that #18075 does slightly more than putting just member 
variables into nested classes. While it is true that a significant degree of 
local state alone that would get inlined to the Outer Class gets inlined 
instead to nested classes instead with #18075, the patch leads to even more 
reductions in the size of the constant pool, since there are additional items 
that get inlined to nested classes that also count towards the limit (e.g. 
field references, method references, variable types, method types, etc, see 
[Java class file, The Constant 
Pool](https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool)). 
    
    The second feature (included here, but not in #18075), is precisely as you 
describe: it takes simply declared fields that would be inlined globally and 
compacts them into like-typed and like-declared arrays.
    
    However, if we set `N = 8000` (even assuming the patch in #18075), we can 
trigger yet another `JaninoRuntimeException`, this time because the amount of 
global state (plus any local state that was inlined to the Outer Class and not 
any single subclass) is sufficiently great to cause the exception. However, if 
we include mutable state compaction and class splitting, like using the method 
included in this PR #16648, we can set `N` to a value greater than 10,000 (I 
had success for the test still at `N = 12000`, but at 16,000 my machine began 
to thrash). 
    
    Conversely, if we only include mutable state compaction at `N = 8000`, and 
exclude class splitting, there are instances where we actually end up with very 
little global state, but the amount of local state and functions inlined to the 
Outer Class is still sufficient to exceed the constant pool limit. This can 
occur if we have a great number of primitive columns, like `N = 8000` integer 
columns. 
    
    Looking at both #18075 and this pull-request, I think the takeaway is that 
even if all we do for the moment is split excess code among nested classes, we 
can still make a significant gain in the number of columns a Dataset can hold, 
which gives #18075 merit on its own. If we want to increase that limit even 
more though, we'll have to address proliferation of global state as well, 
perhaps by opening a follow-up PR that focuses on it more closely, maybe using 
the compaction strategy I've attempted here in #16648, or by exploring another 
method.
    
    Thoughts on this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to