GitHub user jsnowacki opened a pull request:

    https://github.com/apache/spark/pull/19443

    [SPARK-22212][SQL][PySpark] Some SQL functions in Python fail with string 
column name

    ## What changes were proposed in this pull request?
    
    The issue in JIRA: 
[SPARK-22212](https://issues.apache.org/jira/browse/SPARK-22212)
    
    Most of the functions in `pyspark.sql.functions` allow usage of both column 
name string and `Column` object. But there are some functions, like `trim`, 
that require to pass only `Column`. See below code for explanation.
    
    ```
    >>> import pyspark.sql.functions as func
    >>> df = spark.createDataFrame([tuple(l) for l in "abcde"], ["text"])
    >>> df.select(func.trim(df["text"])).show()
    +----------+
    |trim(text)|
    +----------+
    |         a|
    |         b|
    |         c|
    |         d|
    |         e|
    +----------+
    >>> df.select(func.trim("text")).show()
    [...]
    Py4JError: An error occurred while calling 
z:org.apache.spark.sql.functions.trim. Trace:
    py4j.Py4JException: Method trim([class java.lang.String]) does not exist
            at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
            at 
py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339)
            at py4j.Gateway.invoke(Gateway.java:274)
            at 
py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
            at py4j.commands.CallCommand.execute(CallCommand.java:79)
            at py4j.GatewayConnection.run(GatewayConnection.java:214)
            at java.lang.Thread.run(Thread.java:748)
    ```
    
    This is because most of the Python function calls map column name to 
`Column` in the Python function mapping, but functions created via 
`_create_function` pass them as is, if they are not `Column`. On the other 
hand, few functions that require the column name has been moved 
`functions_by_column_name`, and are created by 
`_create_function_by_column_name`.
    
    Note that this is only Python-side fix. Some Scala functions still do not 
have method to call them by string column name.
    
    ## How was this patch tested?
    
    Additional Python tests where written to accommodate this. It was tested 
via `UnitTest` in IDE and the overall `python\run_tests` script.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jsnowacki/spark-1 fix_func_str_to_col

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/19443.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #19443
    
----
commit c5dbd50361a37e9833708dc8985345fbf537e8d9
Author: Jakub Nowacki <j.s.nowa...@gmail.com>
Date:   2017-10-03T07:50:50Z

    [SPARK-22212] Fixing string to column mapping in Python functions

commit 9e52c6380ae8787d20e3442cfaf42cfb70caf4dc
Author: Jakub Nowacki <j.s.nowa...@gmail.com>
Date:   2017-10-06T09:07:26Z

    [SPARK-22212] Calling functions by string column name fixed and tested

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to