(spark) branch master updated: [SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace

wenchen Wed, 03 Jun 2026 20:55:53 -0700

This is an automated email from the ASF dual-hosted git repository.

cloud-fan pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git



The following commit(s) were added to refs/heads/master by this push:
     new d35df2e6e05d [SPARK-57190][SQL] Fix API inconsistency for 4-argument 
regexp_replace
d35df2e6e05d is described below

commit d35df2e6e05df3c5963eb8eeab4ac4a69732fe53
Author: pchintar <[email protected]>
AuthorDate: Thu Jun 4 11:55:32 2026 +0800

    [SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace
    
    ### What changes were proposed in this pull request?
    
    Spark SQL already supports the 4-argument form of `regexp_replace`:
    
    ```sql
    regexp_replace(str, regexp, replacement, position)
    ```
    
    However, the corresponding Scala, PySpark, and Spark Connect APIs currently 
expose only the 3-argument variants.
    
    This PR exposes the existing 4-argument functionality through:
    
    * Scala API (`functions.regexp_replace`)
    * PySpark API (`functions.regexp_replace`)
    * Spark Connect API (`functions.regexp_replace`)
    
    and adds corresponding Scala, PySpark, and Connect test coverage.
    
    ### Why are the changes needed?
    
    The underlying SQL functionality already exists and is available through 
SQL expressions, but it is not accessible through the public Scala, PySpark, 
and Spark Connect APIs.
    
    This creates an inconsistency between SQL and programmatic interfaces. 
Exposing the optional `position` argument aligns the public APIs with existing 
SQL functionality.
    
    ### Does this PR introduce *any* user-facing change?
    
    Yes.
    
    Users can now specify the optional `position` argument through the Scala, 
PySpark, and Spark Connect APIs.
    
    Before:
    
    ```scala
    regexp_replace(col("s"), "(\\d+)", "--")
    ```
    
    After:
    
    ```scala
    regexp_replace(col("s"), "(\\d+)", "--", 5)
    ```
    
    Similarly, PySpark users can now call:
    
    ```python
    F.regexp_replace("s", r"(\d+)", "--", 5)
    ```
    
    ### How was this patch tested?
    
    Added test coverage in:
    
    * `StringFunctionsSuite`
    * `FunctionsTests.test_regexp_replace`
    * `SparkConnectFunctionTests.test_string_functions_multi_args`
    
    Verified with:
    
    ```bash
    ./build/sbt "sql-api/compile"
    
    ./build/sbt "sql/Test/compile"
    
    ./build/sbt \
    "sql/testOnly org.apache.spark.sql.StringFunctionsSuite -- -z 
\"regex_replace / regex_extract\""
    
    python3.11 -m pytest \
    
python/pyspark/sql/tests/test_functions.py::FunctionsTests::test_regexp_replace 
-v
    
    python3.11 -m pytest \
    
python/pyspark/sql/tests/connect/test_connect_function.py::SparkConnectFunctionTests::test_string_functions_multi_args
 -v
    ```
    
    ### Was this patch authored or co-authored using generative AI tooling?
    
    Developed with assistance from GPT-5.
    
    Closes #56240 from pchintar/regexp_replace.
    
    Authored-by: pchintar <[email protected]>
    Signed-off-by: Wenchen Fan <[email protected]>
---
 python/pyspark/sql/connect/functions/builtin.py    | 21 ++++++++++++--
 python/pyspark/sql/functions/builtin.py            | 32 ++++++++++++++++++++--
 .../sql/tests/connect/test_connect_function.py     |  4 +++
 python/pyspark/sql/tests/test_functions.py         |  4 +++
 .../scala/org/apache/spark/sql/functions.scala     | 20 ++++++++++++++
 .../apache/spark/sql/StringFunctionsSuite.scala    |  7 +++--
 6 files changed, 82 insertions(+), 6 deletions(-)

diff --git a/python/pyspark/sql/connect/functions/builtin.py 
b/python/pyspark/sql/connect/functions/builtin.py
index fd88faef1047..c1109df1c41a 100644
--- a/python/pyspark/sql/connect/functions/builtin.py
+++ b/python/pyspark/sql/connect/functions/builtin.py
@@ -2815,9 +2815,26 @@ regexp_extract_all.__doc__ = 
pysparkfuncs.regexp_extract_all.__doc__
 
 
 def regexp_replace(
-    string: "ColumnOrName", pattern: Union[str, Column], replacement: 
Union[str, Column]
+    string: "ColumnOrName",
+    pattern: Union[str, Column],
+    replacement: Union[str, Column],
+    position: Optional[Union[int, Column]] = None,
 ) -> Column:
-    return _invoke_function_over_columns("regexp_replace", string, 
lit(pattern), lit(replacement))
+    if position is None:
+        return _invoke_function_over_columns(
+            "regexp_replace",
+            string,
+            lit(pattern),
+            lit(replacement),
+        )
+    else:
+        return _invoke_function_over_columns(
+            "regexp_replace",
+            string,
+            lit(pattern),
+            lit(replacement),
+            lit(position),
+        )
 
 
 regexp_replace.__doc__ = pysparkfuncs.regexp_replace.__doc__
diff --git a/python/pyspark/sql/functions/builtin.py 
b/python/pyspark/sql/functions/builtin.py
index 1bf2754282df..841d422f2026 100644
--- a/python/pyspark/sql/functions/builtin.py
+++ b/python/pyspark/sql/functions/builtin.py
@@ -16350,7 +16350,10 @@ def regexp_extract_all(
 
 @_try_remote_functions
 def regexp_replace(
-    string: "ColumnOrName", pattern: Union[str, Column], replacement: 
Union[str, Column]
+    string: "ColumnOrName",
+    pattern: Union[str, Column],
+    replacement: Union[str, Column],
+    position: Optional[Union[int, Column]] = None,
 ) -> Column:
     r"""Replace all substrings of the specified string value that match regexp 
with replacement.
 
@@ -16358,6 +16361,8 @@ def regexp_replace(
 
     .. versionchanged:: 3.4.0
         Supports Spark Connect.
+    .. versionchanged:: 4.3.0
+        Supports the `position` parameter.
 
     Parameters
     ----------
@@ -16367,6 +16372,8 @@ def regexp_replace(
         column object or str containing the regexp pattern
     replacement : :class:`~pyspark.sql.Column` or str
         column object or str containing the replacement
+    position : :class:`~pyspark.sql.Column` or int, optional
+        position to start replacement. The first position is 1.
 
     Returns
     -------
@@ -16404,8 +16411,29 @@ def regexp_replace(
     +-------+-------+-----------+--------------------------------------------+
     |100-200|  (\d+)|         --|                                       -----|
     +-------+-------+-----------+--------------------------------------------+
+
+    Example 3: Replaces substrings starting from the specified position.
+    For the input string "100-200", position 5 starts replacement after "100-".
+
+    >>> df.select(sf.regexp_replace("str", r"(\d+)", "--", 
5).alias("d")).show()
+    +------+
+    |     d|
+    +------+
+    |100---|
+    +------+
     """
-    return _invoke_function_over_columns("regexp_replace", string, 
lit(pattern), lit(replacement))
+    if position is None:
+        return _invoke_function_over_columns(
+            "regexp_replace", string, lit(pattern), lit(replacement)
+        )
+    else:
+        return _invoke_function_over_columns(
+            "regexp_replace",
+            string,
+            lit(pattern),
+            lit(replacement),
+            lit(position),
+        )
 
 
 @_try_remote_functions
diff --git a/python/pyspark/sql/tests/connect/test_connect_function.py 
b/python/pyspark/sql/tests/connect/test_connect_function.py
index a2344064bb6f..07ebe0b3b8a0 100644
--- a/python/pyspark/sql/tests/connect/test_connect_function.py
+++ b/python/pyspark/sql/tests/connect/test_connect_function.py
@@ -2215,6 +2215,10 @@ class SparkConnectFunctionTests(ReusedMixedTestCase, 
PandasOnSparkTestUtils):
             cdf.select(CF.regexp_replace(cdf.b, "(a+)(b)?(c)", 
"--")).toPandas(),
             sdf.select(SF.regexp_replace(sdf.b, "(a+)(b)?(c)", 
"--")).toPandas(),
         )
+        self.assert_eq(
+            cdf.select(CF.regexp_replace(cdf.b, "(a+)(b)?(c)", "--", 
2)).toPandas(),
+            sdf.select(SF.regexp_replace(sdf.b, "(a+)(b)?(c)", "--", 
2)).toPandas(),
+        )
         self.assert_eq(
             cdf.select(CF.translate(cdf.b, "abc", "xyz")).toPandas(),
             sdf.select(SF.translate(sdf.b, "abc", "xyz")).toPandas(),
diff --git a/python/pyspark/sql/tests/test_functions.py 
b/python/pyspark/sql/tests/test_functions.py
index ceba0f03ae25..b26163a667c2 100644
--- a/python/pyspark/sql/tests/test_functions.py
+++ b/python/pyspark/sql/tests/test_functions.py
@@ -3278,11 +3278,15 @@ class FunctionsTestsMixin:
         df = self.spark.createDataFrame(
             [("100-200", r"(\d+)", "--")], ["str", "pattern", "replacement"]
         )
+
         self.assertTrue(
             all(
                 df.select(
                     F.regexp_replace("str", r"(\d+)", "--") == "-----",
                     F.regexp_replace("str", F.col("pattern"), 
F.col("replacement")) == "-----",
+                    F.regexp_replace("str", r"(\d+)", "--", 5) == "100---",
+                    F.regexp_replace("str", F.col("pattern"), 
F.col("replacement"), F.lit(5))
+                    == "100---",
                 ).first()
             )
         )
diff --git a/sql/api/src/main/scala/org/apache/spark/sql/functions.scala 
b/sql/api/src/main/scala/org/apache/spark/sql/functions.scala
index 7afa1cba46eb..7afeedb4439c 100644
--- a/sql/api/src/main/scala/org/apache/spark/sql/functions.scala
+++ b/sql/api/src/main/scala/org/apache/spark/sql/functions.scala
@@ -5392,6 +5392,16 @@ object functions {
   def regexp_replace(e: Column, pattern: String, replacement: String): Column =
     regexp_replace(e, lit(pattern), lit(replacement))
 
+  /**
+   * Replace all substrings of the specified string value that match regexp 
with rep, starting at
+   * the specified position `pos`.
+   *
+   * @group string_funcs
+   * @since 4.3.0
+   */
+  def regexp_replace(e: Column, pattern: String, replacement: String, pos: 
Int): Column =
+    regexp_replace(e, lit(pattern), lit(replacement), lit(pos))
+
   /**
    * Replace all substrings of the specified string value that match regexp 
with rep.
    *
@@ -5401,6 +5411,16 @@ object functions {
   def regexp_replace(e: Column, pattern: Column, replacement: Column): Column =
     Column.fn("regexp_replace", e, pattern, replacement)
 
+  /**
+   * Replace all substrings of the specified string value that match regexp 
with rep, starting at
+   * the specified position `pos`.
+   *
+   * @group string_funcs
+   * @since 4.3.0
+   */
+  def regexp_replace(e: Column, pattern: Column, replacement: Column, pos: 
Column): Column =
+    Column.fn("regexp_replace", e, pattern, replacement, pos)
+
   /**
    * Returns the substring that matches the regular expression `regexp` within 
the string `str`.
    * If the regular expression is not found, the result is null.
diff --git 
a/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala 
b/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
index 635894728546..b507cb62f9d1 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/StringFunctionsSuite.scala
@@ -188,9 +188,12 @@ class StringFunctionsSuite extends SharedSparkSession {
       df.select(
         regexp_replace($"a", "(\\d+)", "num"),
         regexp_replace($"a", $"b", $"c"),
+        regexp_replace($"a", "(\\d+)", "num", 5),
+        regexp_replace($"a", $"b", $"c", lit(5)),
         regexp_extract($"a", "(\\d+)-(\\d+)", 1)),
-      Row("num-num", "300", "100") :: Row("num-num", "400", "100") ::
-        Row("num-num", "400-400", "100") :: Nil)
+      Row("num-num", "300", "100-num", "100-200", "100") ::
+        Row("num-num", "400", "100-num", "100-200", "100") ::
+        Row("num-num", "400-400", "100-num", "100-400", "100") :: Nil)
 
     // for testing the mutable state of the expression in code gen.
     // This is a hack way to enable the codegen, thus the codegen is enable by 
default,


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(spark) branch master updated: [SPARK-57190][SQL] Fix API inconsistency for 4-argument regexp_replace

Reply via email to