(spark) branch master updated: [SPARK-48678][CORE] Performance optimizations for SparkConf.get(ConfigEntry)

gurwls223 Sat, 22 Jun 2024 18:59:01 -0700

This is an automated email from the ASF dual-hosted git repository.

gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new 4b37eb8169c9 [SPARK-48678][CORE] Performance optimizations for 
SparkConf.get(ConfigEntry)
4b37eb8169c9 is described below

commit 4b37eb8169c948319dfe400516690081c5219db5
Author: Josh Rosen <[email protected]>
AuthorDate: Sun Jun 23 10:58:41 2024 +0900

    [SPARK-48678][CORE] Performance optimizations for SparkConf.get(ConfigEntry)
    
    ### What changes were proposed in this pull request?
    
    This PR proposes two micro-optimizations for `SparkConf.get(ConfigEntry)`:
    
    1. Avoid costly `Regex.replaceAllIn` for variable substitution: if the 
config value does not contain the substring `${` then it cannot possibly 
contain any variables, so we can completely skip the regex evaluation in such 
cases.
    2. Avoid Scala collections operations, including `List.flatten` and 
`AbstractIterable.mkString`, for the common case where a configuration does not 
define a prepended configuration key.
    
    ### Why are the changes needed?
    
    Improve performance.
    
    This is primarily motivated by unit testing and benchmarking scenarios but 
it will also slightly benefit production queries.
    
    Spark tries to avoid excessive configuration reading in hot paths (e.g. via 
changes like https://github.com/apache/spark/pull/46979). If we do accidentally 
introduce such read paths, though, then this PR's optimizations will help to 
greatly reduce the associated perf. penalty.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    Correctness should be covered by existing unit tests.
    
    To measure performance, I did some manual benchmarking by running
    
    ```
    val conf = new SparkConf()
    conf.set("spark.network.crypto.enabled", "true")
    ```
    
    followed by
    
    ```
    conf.get(NETWORK_CRYPTO_ENABLED)
    ```
    
    10 million times in a loop.
    
    On my laptop, the optimized code is ~7.5x higher throughput than the 
original.
    
    We can also compare the before-and-after flamegraphs from a `while(true)` 
configuration reading loop, showing a clear difference in hotspots before and 
after this change:
    
    **Before**:
    
    
![image](https://github.com/apache/spark/assets/50748/a60cec03-2400-46a5-95f5-f33b88a4872a)
    
    **After**:
    
    
![image](https://github.com/apache/spark/assets/50748/10a45575-148b-4f5f-a431-b8036fe59866)
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Generated-by: Github Copilot
    
    Closes #47049 from JoshRosen/SPARK-48678-sparkconf-perf-optimizations.
    
    Authored-by: Josh Rosen <[email protected]>
    Signed-off-by: Hyukjin Kwon <[email protected]>
---
 .../org/apache/spark/internal/config/ConfigEntry.scala   | 16 ++++++++--------
 .../org/apache/spark/internal/config/ConfigReader.scala  |  4 +++-
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git 
a/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala 
b/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
index a295ef06a637..17d3329e6b49 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
@@ -89,14 +89,14 @@ private[spark] abstract class ConfigEntry[T] (
   def defaultValueString: String
 
   protected def readString(reader: ConfigReader): Option[String] = {
-    val values = Seq(
-      prependedKey.flatMap(reader.get(_)),
-      alternatives.foldLeft(reader.get(key))((res, nextKey) => 
res.orElse(reader.get(nextKey)))
-    ).flatten
-    if (values.nonEmpty) {
-      Some(values.mkString(prependSeparator))
-    } else {
-      None
+    // SPARK-48678: performance optimization: this code could be expressed 
more succinctly
+    // using flatten and mkString, but doing so adds lots of Scala collections 
perf. overhead.
+    val maybePrependedValue: Option[String] = prependedKey.flatMap(reader.get)
+    val maybeValue: Option[String] = alternatives
+      .foldLeft(reader.get(key))((res, nextKey) => 
res.orElse(reader.get(nextKey)))
+    (maybePrependedValue, maybeValue) match {
+      case (Some(prependedValue), Some(value)) => 
Some(s"$prependedValue$prependSeparator$value")
+      case _ => maybeValue.orElse(maybePrependedValue)
     }
   }
 
diff --git 
a/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala 
b/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
index c1ab22150d02..8824d196489a 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
@@ -84,7 +84,9 @@ private[spark] class ConfigReader(conf: ConfigProvider) {
   def substitute(input: String): String = substitute(input, Set())
 
   private def substitute(input: String, usedRefs: Set[String]): String = {
-    if (input != null) {
+    // SPARK-48678: performance optimization: skip the costly regex processing
+    // if the string cannot possibly contain a variable reference:
+    if (input != null && input.contains("${")) {
       ConfigReader.REF_RE.replaceAllIn(input, { m =>
         val prefix = m.group(1)
         val name = m.group(2)


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-48678][CORE] Performance optimizations for SparkConf.get(ConfigEntry)

Reply via email to