Github user bdrillard commented on the issue:
https://github.com/apache/spark/pull/19518
@kiszk You are correct that the current implementation compacts all mutable
state (where the state does not have to be _explicitly_ inlined).
To your last question, I'd attempted some analysis of the JVM bytecode of
array versus inlined state initialized either through method calls or in loops.
I'd posted the experiment and results: https://github.com/bdrillard/bytecode-poc
If Spark has its own benchmarking tools, I'd be happy to use those to
compare Catalyst-generated classes further.
To the general question of _when_ we compact state, I think some kind of
threshold still does makes sense. It would be best to ensure that the typical
code path (for typical Dataset schemas) remains un-impacted by the changes (as
was the aim when generating nested classes in #18075).
I've found trying to set a global threshold for when to compact mutable
state can be hard. Some state _has_ to be inlined (state that uses
parameterized constructors that can't be easily initialized with loops, like
the `BufferHolder` and `UnsafeRowWriter`). I've found situations where, due to
code generator flow, we began by inlining an amount of state that _could have
been_ compacted, then started compacting state as after a set threshold, but
then began inlining state again that _could not be_ compacted, forcing us over
the constant pool limit.
It's difficult to tell when a certain piece of state will be referenced
frequently or infrequently. For example, we do know some pieces of primitive
mutable state, like global booleans that are part of conditional checks, are
initialized globally, assigned once in one method, and then referenced only
once in a separate caller method. These are excellent candidates for
compaction, since they proliferate very quickly and are, in a sense, "only used
once" (declared, initialized, re-assigned in a method, accessed in another
method, never used again).
Other pieces of state, like row objects, and JavaBean objects, will be
accessed a number of times relative to how many fields they have, which isn't
necessarily easy info to retrieve during code generation (we'd have to reflect
or do inspection of the initialization code to know how many fields such an
object has). But these items are probably still good candidates for compaction
in general because of how many of a given type there could be.
I'm inclined to use a threshold against the name/types of the state, rather
than a global threshold. Since `freshName` is always monotonically increasing
from 1 for a given variable prefix, we could know when a threshold for state of
that type was reached, and when we could begin compacting that type of state,
independently/concurrently with the other types of state. Such a scheme would
allow us to ensure the usual flow of code-generation remains as it is now, with
no state-compaction for typical operations, and then with state-compaction in
the more extreme cases that would threaten to blow the Constant Pool limit.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]