[GitHub] spark pull request #16115: [SPARK-18667][PySpark][SQL] Change the way to gro...

viirya Thu, 01 Dec 2016 22:08:28 -0800

GitHub user viirya opened a pull request:

    https://github.com/apache/spark/pull/16115


    [SPARK-18667][PySpark][SQL] Change the way to group row in 
BatchEvalPythonExec so input_file_name function can work with UDF in pyspark

    ## What changes were proposed in this pull request?
    
    `input_file_name` doesn't return filename when working with UDF in PySpark. 
An example shows the problem:
    
        from pyspark.sql.functions import *
        from pyspark.sql.types import *
    
        def filename(path):
            return path
    
        sourceFile = udf(filename, StringType())
        spark.read.json("tmp.json").select(sourceFile(input_file_name())).show()
    
        +---------------------------+
        |filename(input_file_name())|
        +---------------------------+
        |                           |
        +---------------------------+    
    
    The cause of this issue is, we group rows in `BatchEvalPythonExec` for 
batching processing of PythonUDF. Currently we group rows first and then 
evaluate expressions on the rows. If the data is less than the required number 
of rows for a group, the iterator will be consumed to the end before the 
evaluation. However, once the iterator reaches the end, we will unset input 
filename. So the input_file_name expression can't return correct filename.
    
    This patch fixes the approach to group the batch of rows. We evaluate the 
expression first and then group evaluated results to batch.
    
    ## How was this patch tested?
    
    Added unit test to PySpark.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/viirya/spark-1 fix-py-udf-input-filename

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16115
    
----
commit 7cd606b6605ac75f311dca2cff988f20ba0ad7a0
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-12-02T05:50:47Z

    Change the way to group row in BatchEvalPythonExec so udf works with 
input_file_name in pyspark.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16115: [SPARK-18667][PySpark][SQL] Change the way to gro...

Reply via email to