GitHub user HyukjinKwon opened a pull request:

    https://github.com/apache/spark/pull/19027

    [SPARK-19165][PYTHON][SQL] PySpark APIs using columns as arguments should 
validate input types for column

    ## What changes were proposed in this pull request?
    
    While preparing to take over https://github.com/apache/spark/pull/16537, I 
realised a (I think) better approach to make the exception handling in one 
point.
    
    This PR proposes to fix `_to_java_column` in `pyspark.sql.column`, which 
most of functions in `functions.py` and some other APIs use. This 
`_to_java_column` basically looks not working with other types than 
`pyspark.sql.column.Column` or string (`str` and `unicode`). 
    
    If this is not `Column`, then it calls `_create_column_from_name` which 
calls `functions.col` within JVM:
    
    
https://github.com/apache/spark/blob/42b9eda80e975d970c3e8da4047b318b83dd269f/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L76
    
    And it looks we only have `String` one with `col`.
    
    So, these should work:
    
    ```python
    >>> from pyspark.sql.column import _to_java_column, Column
    >>> _to_java_column("a")
    JavaObject id=o28
    >>> _to_java_column(u"a")
    JavaObject id=o29
    >>> _to_java_column(spark.range(1).id)
    JavaObject id=o33
    ```
    
    whereas these do not:
    
    ```python
    >>> _to_java_column(1)
    ```
    ```
    ...
    py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.col. Trace:
    py4j.Py4JException: Method col([class java.lang.Integer]) does not exist
        ...
    ```
    
    ```python
    >>> _to_java_column([])
    ```
    ```
    ...
    py4j.protocol.Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.col. Trace:
    py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist
        ...
    ```
    
    ```python
    >>> class A(): pass
    >>> _to_java_column(A())
    ```
    ```
    ...
    AttributeError: 'A' object has no attribute '_get_object_id'
    ```
    
    Meaning most of functions using `_to_java_column` such as `udf` or 
`to_json` or some other APIs throw an exception as below:
    
    ```python
    >>> from pyspark.sql.functions import udf
    >>> udf(lambda x: x)(None)
    ```
    
    ```
    ...
    py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.functions.col.
    : java.lang.NullPointerException
        ...
    ```
    
    ```python
    >>> from pyspark.sql.functions import to_json
    >>> to_json(None)
    ```
    
    ```
    ...
    py4j.protocol.Py4JJavaError: An error occurred while calling 
z:org.apache.spark.sql.functions.col.
    : java.lang.NullPointerException
        ...
    ```
    
    **After this PR**:
    
    ```python
    >>> from pyspark.sql.functions import udf
    >>> udf(lambda x: x)(None)
    ...
    ```
    
    ```
    TypeError: Invalid argument, not a string or column: None of type <type 
'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
functions.
    ```
    
    ```python
    >>> from pyspark.sql.functions import to_json
    >>> to_json(None)
    ```
    
    ```
    ...
    TypeError: Invalid argument, not a string or column: None of type <type 
'NoneType'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' 
functions.
    ```
    
    ## How was this patch tested?
    
    Unit tests added in `python/pyspark/sql/tests.py` and manual tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HyukjinKwon/spark SPARK-19165

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19027.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19027
    
----
commit b83063852ded0614a74cc2853af12f61ea00d28c
Author: zero323 <[email protected]>
Date:   2017-06-20T20:42:57Z

    Validate types in UserDefinedFunction.__call__

commit d14c2cc9aabfbfa2294f7e4937704fc63717e321
Author: hyukjinkwon <[email protected]>
Date:   2017-08-23T08:38:03Z

    Validate column types

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to