GitHub user HyukjinKwon opened a pull request:
https://github.com/apache/spark/pull/19027
[SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should
validate input types for column
## What changes were proposed in this pull request?
While preparing to take over https://github.com/apache/spark/pull/16537, I
realised a (I think) better approach to make the exception handling in one
point.
This PR proposes to fix `_to_java_column` in `pyspark.sql.column`, which
most of functions in `functions.py` and some other APIs use. This
`_to_java_column` basically looks not working with other types than
`pyspark.sql.column.Column` or string (`str` and `unicode`).
If this is not `Column`, then it calls `_create_column_from_name` which
calls `functions.col` within JVM:
https://github.com/apache/spark/blob/42b9eda80e975d970c3e8da4047b318b83dd269f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L76
And it looks we only have `String` one with `col`.
So, these should work:
```python
>>> from pyspark.sql.column import _to_java_column, Column
>>> _to_java_column("a")
JavaObject id=o28
>>> _to_java_column(u"a")
JavaObject id=o29
>>> _to_java_column(spark.range(1).id)
JavaObject id=o33
```
whereas these do not:
```python
>>> _to_java_column(1)
```
```
...
py4j.protocol.Py4JError: An error occurred while calling
z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
...
```
```python
>>> _to_java_column([])
```
```
...
py4j.protocol.Py4JError: An error occurred while calling
z:org.apache.spark.sql.functions.col. Trace:
py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
...
```
```python
>>> class A(): pass
>>> _to_java_column(A())
```
```
...
AttributeError: 'A' object has no attribute '_get_object_id'
```
Meaning most of functions using `_to_java_column` such as `udf` or
`to_json` or some other APIs throw an exception as below:
```python
>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
```
```
...
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
...
```
```python
>>> from pyspark.sql.functions import to_json
>>> to_json(None)
```
```
...
py4j.protocol.Py4JJavaError: An error occurred while calling
z:org.apache.spark.sql.functions.col.
: java.lang.NullPointerException
...
```
**After this PR**:
```python
>>> from pyspark.sql.functions import udf
>>> udf(lambda x: x)(None)
...
```
```
TypeError: Invalid argument, not a string or column: None of type <type
'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
functions.
```
```python
>>> from pyspark.sql.functions import to_json
>>> to_json(None)
```
```
...
TypeError: Invalid argument, not a string or column: None of type <type
'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map'
functions.
```
## How was this patch tested?
Unit tests added in `python/pyspark/sql/tests.py` and manual tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HyukjinKwon/spark SPARK-19165
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19027.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19027
----
commit b83063852ded0614a74cc2853af12f61ea00d28c
Author: zero323 <[email protected]>
Date: 2017-06-20T20:42:57Z
Validate types in UserDefinedFunction.__call__
commit d14c2cc9aabfbfa2294f7e4937704fc63717e321
Author: hyukjinkwon <[email protected]>
Date: 2017-08-23T08:38:03Z
Validate column types
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]