[
https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-33310:
------------------------------------
Assignee: Apache Spark
> Relax pyspark typing for sql str functions
> ------------------------------------------
>
> Key: SPARK-33310
> URL: https://issues.apache.org/jira/browse/SPARK-33310
> Project: Spark
> Issue Type: Wish
> Components: PySpark
> Affects Versions: 3.1.0
> Reporter: Daniel Himmelstein
> Assignee: Apache Spark
> Priority: Minor
> Labels: pyspark.sql.functions, type
> Fix For: 3.1.0
>
>
> Several pyspark.sql.functions have overly strict typing, in that the type is
> more restrictive than the functionality. Specifically, the function allows
> specifying the column to operate on with a pyspark.sql.Column or a str. This
> is handled internally by
> [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
> which accepts a Column or string.
> There is a pre-existing type for this:
> [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
> ColumnOrName is used for many of the type definitions of
> pyspark.sql.functions arguments, but [not
> for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
> locate, lpad, rpad, repeat, and split.
> {code:java}
> def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
> def lpad(col: Column, len: int, pad: str) -> Column: ...
> def rpad(col: Column, len: int, pad: str) -> Column: ...
> def repeat(col: Column, n: int) -> Column: ...
> def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
> ColumnOrName was not added by [~zero323] since Maciej "was concerned that
> this might be confusing or ambiguous", because these functions take a column
> to operate on as well strings which are used in the operation.
> But I think ColumnOrName makes clear that this variable refers to the column
> and not a string parameter. Also there are other ways to address confusion,
> such as via the docstring or by changing the argument name for the column to
> col from str.
> Finally, there's considerable convenience for users to not have to wrap
> column names in pyspark.sql.functions.col. Elsewhere the API seems pretty
> consistent in its willingness to accept columns by name and not Column object
> (at least when there is not alternative meaning for a string value, exception
> would be .when/.otherwise).
> For example, we were pyspark.sql.functions.split with a string value for the
> str argument (specifying which column to split). And I noticed this when we
> enforced typing with pyspark-stubs in preparation for pyspark 3.1.
> Pre-existing PRs to address this:
> * https://github.com/apache/spark/pull/30209
> * https://github.com/zero323/pyspark-stubs/pull/420
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]