Github user bdrillard commented on a diff in the pull request:
https://github.com/apache/spark/pull/16648#discussion_r105502401
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/MonotonicallyIncreasingID.scala
---
@@ -67,14 +67,15 @@ case class MonotonicallyIncreasingID() extends
LeafExpression with Nondeterminis
override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
val countTerm = ctx.freshName("count")
val partitionMaskTerm = ctx.freshName("partitionMask")
- ctx.addMutableState(ctx.JAVA_LONG, countTerm, "")
- ctx.addMutableState(ctx.JAVA_LONG, partitionMaskTerm, "")
- ctx.addPartitionInitializationStatement(s"$countTerm = 0L;")
- ctx.addPartitionInitializationStatement(s"$partitionMaskTerm = ((long)
partitionIndex) << 33;")
+ val countTermAccessor = ctx.addMutableState(ctx.JAVA_LONG, countTerm,
"")
+ val partitionMaskTermAccessor = ctx.addMutableState(ctx.JAVA_LONG,
partitionMaskTerm, "")
+ ctx.addPartitionInitializationStatement(s"$countTermAccessor = 0L;")
+ ctx.addPartitionInitializationStatement(
+ s"$partitionMaskTermAccessor = ((long) partitionIndex) << 33;")
ev.copy(code = s"""
- final ${ctx.javaType(dataType)} ${ev.value} = $partitionMaskTerm +
$countTerm;
- $countTerm++;""", isNull = "false")
+ final ${ctx.javaType(dataType)} ${ev.value} =
$partitionMaskTermAccessor + $countTermAccessor;
+ $countTermAccessor++;""", isNull = "false")
--- End diff --
Having
[`addMutableState`](https://github.com/apache/spark/pull/16648/files#diff-8bcc5aea39c73d4bf38aef6f6951d42cR181)
return an accessor string is an important part of addressing the manner in
which mutable state can contribute to Constant Pool errors. Code that creates
mutable state usually takes for granted that the symbol used to declare the
state will be inlined as a private member variable to the class. However, for
sufficiently complicated schemas, mutable state and its initialization alone
can breach the Constant Pool limit. The strategy I settled on was to have
mutable state potentially be compacted into arrays of like type and
initialization, this way, we can reduce the number of references that would
count to the constant pool limit. Of course, if the mutable state is stored in
an array, rather than in a private variable named after the symbol, we need to
return back the accessor for that index in the compacted mutable state array,
hence the 'accessor' suffixes. I had
also tried a class-based approach, in which excess mutable state could become
static members of nested classes, initialization functions for the state could
still exceed the constant pool limit.
This PR can be condensed to two core components to approach a solution to
the (hard-and-fast) Constant Pool limit:
* split excess code among classes
* compact excess mutable state into arrays
I should mention, not *all* mutable state is compacted into arrays. Only
primitives and collections of simply-assigned objects (null assigned, or no
assignment). But this array compaction strategy reduces references enough to
allow even complex schemas in which we would potentially generate much more
state than 2^16 to still be converted to datasets.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]