This is an automated email from the ASF dual-hosted git repository.
gurwls223 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new 4b37eb8169c9 [SPARK-48678][CORE] Performance optimizations for
SparkConf.get(ConfigEntry)
4b37eb8169c9 is described below
commit 4b37eb8169c948319dfe400516690081c5219db5
Author: Josh Rosen <[email protected]>
AuthorDate: Sun Jun 23 10:58:41 2024 +0900
[SPARK-48678][CORE] Performance optimizations for SparkConf.get(ConfigEntry)
### What changes were proposed in this pull request?
This PR proposes two micro-optimizations for `SparkConf.get(ConfigEntry)`:
1. Avoid costly `Regex.replaceAllIn` for variable substitution: if the
config value does not contain the substring `${` then it cannot possibly
contain any variables, so we can completely skip the regex evaluation in such
cases.
2. Avoid Scala collections operations, including `List.flatten` and
`AbstractIterable.mkString`, for the common case where a configuration does not
define a prepended configuration key.
### Why are the changes needed?
Improve performance.
This is primarily motivated by unit testing and benchmarking scenarios but
it will also slightly benefit production queries.
Spark tries to avoid excessive configuration reading in hot paths (e.g. via
changes like https://github.com/apache/spark/pull/46979). If we do accidentally
introduce such read paths, though, then this PR's optimizations will help to
greatly reduce the associated perf. penalty.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Correctness should be covered by existing unit tests.
To measure performance, I did some manual benchmarking by running
```
val conf = new SparkConf()
conf.set("spark.network.crypto.enabled", "true")
```
followed by
```
conf.get(NETWORK_CRYPTO_ENABLED)
```
10 million times in a loop.
On my laptop, the optimized code is ~7.5x higher throughput than the
original.
We can also compare the before-and-after flamegraphs from a `while(true)`
configuration reading loop, showing a clear difference in hotspots before and
after this change:
**Before**:

**After**:

### Was this patch authored or co-authored using generative AI tooling?
Generated-by: Github Copilot
Closes #47049 from JoshRosen/SPARK-48678-sparkconf-perf-optimizations.
Authored-by: Josh Rosen <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
---
.../org/apache/spark/internal/config/ConfigEntry.scala | 16 ++++++++--------
.../org/apache/spark/internal/config/ConfigReader.scala | 4 +++-
2 files changed, 11 insertions(+), 9 deletions(-)
diff --git
a/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
b/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
index a295ef06a637..17d3329e6b49 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala
@@ -89,14 +89,14 @@ private[spark] abstract class ConfigEntry[T] (
def defaultValueString: String
protected def readString(reader: ConfigReader): Option[String] = {
- val values = Seq(
- prependedKey.flatMap(reader.get(_)),
- alternatives.foldLeft(reader.get(key))((res, nextKey) =>
res.orElse(reader.get(nextKey)))
- ).flatten
- if (values.nonEmpty) {
- Some(values.mkString(prependSeparator))
- } else {
- None
+ // SPARK-48678: performance optimization: this code could be expressed
more succinctly
+ // using flatten and mkString, but doing so adds lots of Scala collections
perf. overhead.
+ val maybePrependedValue: Option[String] = prependedKey.flatMap(reader.get)
+ val maybeValue: Option[String] = alternatives
+ .foldLeft(reader.get(key))((res, nextKey) =>
res.orElse(reader.get(nextKey)))
+ (maybePrependedValue, maybeValue) match {
+ case (Some(prependedValue), Some(value)) =>
Some(s"$prependedValue$prependSeparator$value")
+ case _ => maybeValue.orElse(maybePrependedValue)
}
}
diff --git
a/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
b/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
index c1ab22150d02..8824d196489a 100644
--- a/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
+++ b/core/src/main/scala/org/apache/spark/internal/config/ConfigReader.scala
@@ -84,7 +84,9 @@ private[spark] class ConfigReader(conf: ConfigProvider) {
def substitute(input: String): String = substitute(input, Set())
private def substitute(input: String, usedRefs: Set[String]): String = {
- if (input != null) {
+ // SPARK-48678: performance optimization: skip the costly regex processing
+ // if the string cannot possibly contain a variable reference:
+ if (input != null && input.contains("${")) {
ConfigReader.REF_RE.replaceAllIn(input, { m =>
val prefix = m.group(1)
val name = m.group(2)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]