Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

via GitHub Mon, 15 Apr 2024 19:30:25 -0700


zhengruifeng commented on code in PR #46045:
URL: https://github.com/apache/spark/pull/46045#discussion_r1566629222



##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -10972,6 +10976,11 @@ def split(str: "ColumnOrName", pattern: str, limit: 
int = -1) -> Column:
         .. versionchanged:: 3.0
            `split` now takes an optional `limit` field. If not provided, 
default limit value is -1.
 
+        .. versionchanged:: 4.0.0
+             `pattern` now accepts column. Does not accept column name since 
string type remain
+             accepted as a regular expression representation, for backwards 
compatibility.
+             In addition to int, `limit` now accepts column and column name.
+

Review Comment:
   please add more doctest in the `Examples` section to test the new supported 
types
   
   those doctests will automatically be reused in Spark Connect Python Client.



##########
python/pyspark/sql/functions/builtin.py:
##########
@@ -10985,7 +10994,9 @@ def split(str: "ColumnOrName", pattern: str, limit: int 
= -1) -> Column:
     >>> df.select(split(df.s, '[ABC]', -1).alias('s')).collect()
     [Row(s=['one', 'two', 'three', ''])]
     """
-    return _invoke_function("split", _to_java_column(str), pattern, limit)
+    pattern = pattern if isinstance(pattern, Column) else lit(pattern)
+    limit = lit(limit) if isinstance(limit, int) else limit

Review Comment:
   I think `lit` function accept both Column and int



##########
python/pyspark/sql/connect/functions/builtin.py:
##########
@@ -2476,8 +2476,26 @@ def repeat(col: "ColumnOrName", n: Union["ColumnOrName", 
int]) -> Column:
 repeat.__doc__ = pysparkfuncs.repeat.__doc__
 
 
-def split(str: "ColumnOrName", pattern: str, limit: int = -1) -> Column:
-    return _invoke_function("split", _to_col(str), lit(pattern), lit(limit))
+def split(
+    str: "ColumnOrName",
+    pattern: Union[Column, str],
+    limit: Union["ColumnOrName", int] = -1,
+) -> Column:
+    # work around shadowing of str in the input variable name
+    from builtins import str as py_str
+
+    if isinstance(pattern, py_str):
+        _pattern = lit(pattern)
+    elif isinstance(pattern, Column):
+        _pattern = pattern
+    else:
+        raise PySparkTypeError(
+            error_class="NOT_COLUMN_OR_STR",
+            message_parameters={"arg_name": "pattern", "arg_type": 
type(pattern).__name__},
+        )
+
+    limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
+    return _invoke_function("split", _to_col(str), _pattern, limit)

Review Comment:
   ```suggestion
       limit = lit(limit) if isinstance(limit, int) else _to_col(limit)
       return _invoke_function("split", _to_col(str), lit(pattern), limit)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-47845][SQL][PYTHON][CONNECT] Support Column type in split function for scala and python [spark]

Reply via email to