[PR] [SPARK-37448][SQL] Multiple performance optimizations related to CurrentOrigin.withOrigin [spark]

via GitHub Thu, 06 Jun 2024 16:55:40 -0700


JoshRosen opened a new pull request, #46908:
URL: https://github.com/apache/spark/pull/46908


   ### What changes were proposed in this pull request?
   
   This PR implements multiple performance optimizations related to 
`CurrentOrigin.withOrigin`.
   
   #### Avoiding `withOrigin` in cases where PartialFunctions are undefined
   
   `CurrentOrigin.withOrigin` can be expensive because it currently performs a 
ThreadLocal `get` and two ThreadLocal `set` calls on every call, plus the 
overhead of constructing the lambda to pass to `withOrigin` itself.
   
   Many parts of Catalyst have a pattern where a tree transformation method 
will contain code like
   
   ```scala
   val afterRule = CurrentOrigin.withOrigin {
     rule.applyOrElse(this, identity[BaseType])
   }
   ```
   
   where `rule` is a PartialFunction. With this code, we must pay the cost of 
`withOrigin` even in cases where the `orElse` branch (which returns 
`identity[BaseType](this)` a.k.a. `this`) is hit because the PartialFunction 
does not apply (e.g. because we have a transformation which is only defined for 
`Alias` but is invoked over an entire expression tree).
   
   If we instead rewrite this as
   
   ```scala
   val afterRule = if (rule.isDefinedAt(this)) {
     CurrentOrigin.withOrigin(origin) {
       rule.apply(this)
     }
   } else {
     this
   }
   ```
   
   we can avoid the `withOrigin` overhead in cases where the rule / 
PartialFunction is not defined.
   
   This is essentially free because the [default implementation of 
PartialFunction.applyOrElse](https://github.com/scala/scala/blob/08060e8148972fbec3aff3b56246852adf49605c/src/library/scala/PartialFunction.scala#L213-L214)
 is
   
   ```scala
   def applyOrElse[A1 <: A, B1 >: B](x: A1, default: A1 => B1): B1 =
       if (isDefinedAt(x)) apply(x) else default(x)
   ```
   
   I applied this style of optimization in `TreeNode`, `QueryPlan`, and 
`AnalysisHelper`.
   
   #### Apply similar skipping optimization in `ColumnResolutionHelper`:
   
   In `ColumnResolutionHelper` there were two methods that did something like
   
   ```scala
   CurrentOrigin.withOrigin(origin) {
     if (earlyExitCondition) return
     // rest of code
   }
   ```
   
   which I modified to
   
   ```scala
   if (earlyExitCondition) return
   CurrentOrigin.withOrigin(origin) {
     // rest of code
   }
   ```
   
   to avoid the `withOrigin` overhead in the early-exit case.
   
   #### Small optimization to `withOrigin` itself:
   
   `withOrigin` may often be setting a new origin that is exactly the same 
object as the current origin. We can detect this case with a cheap reference 
equality check and then skip a ThreadLocal set.
   
   ### Why are the changes needed?
   
   Improves Catalyst performance, both in production and in unit tests.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   
   ### How was this patch tested?
   
   I believe that should be covered by existing unit tests.
   
   With these changes, I observed a ~10%+ end-to-end runtime improvement on 
certain large test suites in our fork of Apache Spark.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-37448][SQL] Multiple performance optimizations related to CurrentOrigin.withOrigin [spark]

Reply via email to