GitHub user jsnowacki opened a pull request:
https://github.com/apache/spark/pull/19443
[SPARK-22212][SQL][PySpark] Some SQL functions in Python fail with string
column name
## What changes were proposed in this pull request?
The issue in JIRA:
[SPARK-22212](https://issues.apache.org/jira/browse/SPARK-22212)
Most of the functions in `pyspark.sql.functions` allow usage of both column
name string and `Column` object. But there are some functions, like `trim`,
that require to pass only `Column`. See below code for explanation.
```
>>> import pyspark.sql.functions as func
>>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"])
>>> df.select(func.trim(df["text"])).show()
+----------+
|trim(text)|
+----------+
| a|
| b|
| c|
| d|
| e|
+----------+
>>> df.select(func.trim("text")).show()
[...]
Py4JError: An error occurred while calling
z:org.apache.spark.sql.functions.trim. Trace:
py4j.Py4JException: Method trim([class java.lang.String]) does not exist
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
at py4j.Gateway.invoke(Gateway.java:274)
at
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
```
This is because most of the Python function calls map column name to
`Column` in the Python function mapping, but functions created via
`_create_function` pass them as is, if they are not `Column`. On the other
hand, few functions that require the column name has been moved
`functions_by_column_name`, and are created by
`_create_function_by_column_name`.
Note that this is only Python-side fix. Some Scala functions still do not
have method to call them by string column name.
## How was this patch tested?
Additional Python tests where written to accommodate this. It was tested
via `UnitTest` in IDE and the overall `python\run_tests` script.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jsnowacki/spark-1 fix_func_str_to_col
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19443.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19443
----
commit c5dbd50361a37e9833708dc8985345fbf537e8d9
Author: Jakub Nowacki <[email protected]>
Date: 2017-10-03T07:50:50Z
[SPARK-22212] Fixing string to column mapping in Python functions
commit 9e52c6380ae8787d20e3442cfaf42cfb70caf4dc
Author: Jakub Nowacki <[email protected]>
Date: 2017-10-06T09:07:26Z
[SPARK-22212] Calling functions by string column name fixed and tested
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]