JoshRosen opened a new pull request, #46908:
URL: https://github.com/apache/spark/pull/46908
### What changes were proposed in this pull request?
This PR implements multiple performance optimizations related to
`CurrentOrigin.withOrigin`.
#### Avoiding `withOrigin` in cases where PartialFunctions are undefined
`CurrentOrigin.withOrigin` can be expensive because it currently performs a
ThreadLocal `get` and two ThreadLocal `set` calls on every call, plus the
overhead of constructing the lambda to pass to `withOrigin` itself.
Many parts of Catalyst have a pattern where a tree transformation method
will contain code like
```scala
val afterRule = CurrentOrigin.withOrigin {
rule.applyOrElse(this, identity[BaseType])
}
```
where `rule` is a PartialFunction. With this code, we must pay the cost of
`withOrigin` even in cases where the `orElse` branch (which returns
`identity[BaseType](this)` a.k.a. `this`) is hit because the PartialFunction
does not apply (e.g. because we have a transformation which is only defined for
`Alias` but is invoked over an entire expression tree).
If we instead rewrite this as
```scala
val afterRule = if (rule.isDefinedAt(this)) {
CurrentOrigin.withOrigin(origin) {
rule.apply(this)
}
} else {
this
}
```
we can avoid the `withOrigin` overhead in cases where the rule /
PartialFunction is not defined.
This is essentially free because the [default implementation of
PartialFunction.applyOrElse](https://github.com/scala/scala/blob/08060e8148972fbec3aff3b56246852adf49605c/src/library/scala/PartialFunction.scala#L213-L214)
is
```scala
def applyOrElse[A1 <: A, B1 >: B](x: A1, default: A1 => B1): B1 =
if (isDefinedAt(x)) apply(x) else default(x)
```
I applied this style of optimization in `TreeNode`, `QueryPlan`, and
`AnalysisHelper`.
#### Apply similar skipping optimization in `ColumnResolutionHelper`:
In `ColumnResolutionHelper` there were two methods that did something like
```scala
CurrentOrigin.withOrigin(origin) {
if (earlyExitCondition) return
// rest of code
}
```
which I modified to
```scala
if (earlyExitCondition) return
CurrentOrigin.withOrigin(origin) {
// rest of code
}
```
to avoid the `withOrigin` overhead in the early-exit case.
#### Small optimization to `withOrigin` itself:
`withOrigin` may often be setting a new origin that is exactly the same
object as the current origin. We can detect this case with a cheap reference
equality check and then skip a ThreadLocal set.
### Why are the changes needed?
Improves Catalyst performance, both in production and in unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I believe that should be covered by existing unit tests.
With these changes, I observed a ~10%+ end-to-end runtime improvement on
certain large test suites in our fork of Apache Spark.
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]