Daniel Himmelstein created SPARK-33310:
------------------------------------------
Summary: Relax pyspark typing for sql str functions
Key: SPARK-33310
URL: https://issues.apache.org/jira/browse/SPARK-33310
Project: Spark
Issue Type: Wish
Components: PySpark
Affects Versions: 3.1.0
Reporter: Daniel Himmelstein
Fix For: 3.1.0
Several pyspark.sql.functions have overly strict typing, in that the type is
more restrictive than the functionality. Specifically, the function allows
specifying the column to operate on with a pyspark.sql.Column or a str. This is
handled internally by
[_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
which accepts a Column or string.
There is a pre-existing type for this:
[ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
ColumnOrName is used for many of the type definitions of pyspark.sql.functions
arguments, but [not
for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
locate, lpad, rpad, repeat, and split.
{code:java}
def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
def lpad(col: Column, len: int, pad: str) -> Column: ...
def rpad(col: Column, len: int, pad: str) -> Column: ...
def repeat(col: Column, n: int) -> Column: ...
def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
ColumnOrName was not added by [~zero323] since Maciej "was concerned that this
might be confusing or ambiguous", because these functions take a column to
operate on as well strings which are used in the operation.
But I think ColumnOrName makes clear that this variable refers to the column
and not a string parameter. Also there are other ways to address confusion,
such as via the docstring or by changing the argument name for the column to
col from str.
Finally, there's considerable convenience for users to not have to wrap column
names in pyspark.sql.functions.col. Elsewhere the API seems pretty consistent
in its willingness to accept columns by name and not Column object (at least
when there is not alternative meaning for a string value, exception would be
.when/.otherwise).
For example, we were pyspark.sql.functions.split with a string value for the
str argument (specifying which column to split). And I noticed this when we
enforced typing with pyspark-stubs in preparation for pyspark 3.1.
Pre-existing PRs to address this:
* https://github.com/apache/spark/pull/30209
* https://github.com/zero323/pyspark-stubs/pull/420
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]