[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

xguo27 Mon, 29 Feb 2016 21:45:07 -0800

Github user xguo27 commented on the pull request:

    https://github.com/apache/spark/pull/10935#issuecomment-190554648
  
    Using these two functionally equavalent code snippets:
    
    Scala
    ```
    val data = Seq((1, "1"), (2, "2"), (3, "2"), (1, "3")).toDF("a","b")
    val my_filter = sqlContext.udf.register("my_filter", (a:Int) => a==1)
    data.select(col("a")).distinct().filter(my_filter(col("a")))
    ```
    
    Python
    ```
    data = sqlContext.createDataFrame([(1, "1"), (2, "2"), (3, "2"), (1, "3")], 
["a", "b"])
    my_filter = udf(lambda a: a == 1, BooleanType())
    data.select(col("a")).distinct().filter(my_filter(col("a")))
    ```
    
    The logical plan comes out `execute(aggregateCondition)` in here is as 
below:
    
    
https://github.com/apache/spark/blob/916fc34f98dd731f607d9b3ed657bad6cc30df2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L801
    
    Scala
    ```
    Aggregate [a#8], [UDF(a#8) AS havingCondition#11]
    +- Project [a#8]
       +- Project [_1#6 AS a#8,_2#7 AS b#9]
          +- LocalRelation [_1#6,_2#7], [[1,1],[2,2],[3,2],[1,3]]
    ```
    
    Python
    ```
    Project [havingCondition#2]
    +- Aggregate [a#0L], [pythonUDF#3 AS havingCondition#2]
       +- EvaluatePython PythonUDF#<lambda>(a#0L), pythonUDF#3: boolean
          +- Project [a#0L]
             +- LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at 
applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
    ```
    We can see in Python's case, we inject an extra Project when 
`execute(aggregateCondition)`going through ExtractPythonUDFs, but 
ResolveAggregateFunctions expects an Aggregate here:
    
    
https://github.com/apache/spark/blob/916fc34f98dd731f607d9b3ed657bad6cc30df2c/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L801-L805
    
    
    With this fix, the logical plan generated for Python UDFs does not 
construct a Project if it is an Aggregate, making it consistent with its Scala 
counterpart, which gives correct results for ResolveAggregateFunctions to 
consume:
    
    After fix, Python:
    ```
    Aggregate [a#0L], [pythonUDF#3 AS havingCondition#2]
    +- EvaluatePython PythonUDF#<lambda>(a#0L), pythonUDF#3: boolean
       +- Project [a#0L]
          +- LogicalRDD [a#0L,b#1], MapPartitionsRDD[4] at 
applySchemaToPythonRDD at NativeMethodAccessorImpl.java:-2
    ```



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-12981][SQL] Fix Python UDF extraction f...

Reply via email to