Github user bdrillard commented on the issue:

    https://github.com/apache/spark/pull/19518
  
    @kiszk You are correct that the current implementation compacts all mutable 
state (where the state does not have to be _explicitly_ inlined).
    
    To your last question, I'd attempted some analysis of the JVM bytecode of 
array versus inlined state initialized either through method calls or in loops. 
I'd posted the experiment and results: https://github.com/bdrillard/bytecode-poc
    
    If Spark has its own benchmarking tools, I'd be happy to use those to 
compare Catalyst-generated classes further.
    
    To the general question of _when_ we compact state, I think some kind of 
threshold still does makes sense. It would be best to ensure that the typical 
code path (for typical Dataset schemas) remains un-impacted by the changes (as 
was the aim when generating nested classes in #18075).
    
    I've found trying to set a global threshold for when to compact mutable 
state can be hard. Some state _has_ to be inlined (state that uses 
parameterized constructors that can't be easily initialized with loops, like 
the `BufferHolder` and `UnsafeRowWriter`). I've found situations where, due to 
code generator flow, we began by inlining an amount of state that _could have 
been_ compacted, then started compacting state as after a set threshold, but 
then began inlining state again that _could not be_ compacted, forcing us over 
the constant pool limit.
    
    It's difficult to tell when a certain piece of state will be referenced 
frequently or infrequently. For example, we do know some pieces of primitive 
mutable state, like global booleans that are part of conditional checks, are 
initialized globally, assigned once in one method, and then referenced only 
once in a separate caller method. These are excellent candidates for 
compaction, since they proliferate very quickly and are, in a sense, "only used 
once" (declared, initialized, re-assigned in a method, accessed in another 
method, never used again). 
    
    Other pieces of state, like row objects, and JavaBean objects, will be 
accessed a number of times relative to how many fields they have, which isn't 
necessarily easy info to retrieve during code generation (we'd have to reflect 
or do inspection of the initialization code to know how many fields such an 
object has). But these items are probably still good candidates for compaction 
in general because of how many of a given type there could be. 
    
    I'm inclined to use a threshold against the name/types of the state, rather 
than a global threshold. Since `freshName` is always monotonically increasing 
from 1 for a given variable prefix, we could know when a threshold for state of 
that type was reached, and when we could begin compacting that type of state, 
independently/concurrently with the other types of state. Such a scheme would 
allow us to ensure the usual flow of code-generation remains as it is now, with 
no state-compaction for typical operations, and then with state-compaction in 
the more extreme cases that would threaten to blow the Constant Pool limit.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to