GitHub user cloud-fan opened a pull request:

    [SPARK-26293][SQL] Cast exception when having python udf in subquery

    ## What changes were proposed in this pull request?
    This is a regression introduced by at Spark 2.4.0.
    When we have Python UDF in subquery, we will hit an exception
    Caused by: java.lang.ClassCastException: 
org.apache.spark.sql.catalyst.expressions.AttributeReference cannot be cast to 
        at org.apache.spark.rdd.RDD.$anonfun$mapPartitions$2(RDD.scala:815)
    ``` turned `ExtractPythonUDFs` from 
a physical rule to optimizer rule. However, there is a difference between a 
physical rule and optimizer rule. A physical rule always runs once, an 
optimizer rule may be applied twice on a query tree even the rule is located in 
a batch that only runs once.
    For a subquery, the `OptimizeSubqueries` rule will execute the entire 
optimizer on the query plan inside subquery. Later on subquery will be turned 
to joins, and the optimizer rules will be applied to it again.
    Unfortunately, the `ExtractPythonUDFs` rule is not idempotent. When it's 
applied twice on a query plan inside subquery, it will produce a malformed 
plan. It extracts Python UDF from Python exec plans.
    This PR proposes 2 changes to be double safe:
    1. `ExtractPythonUDFs` should skip python exec plans, to make the rule 
    2. `ExtractPythonUDFs` should skip subquery
    ## How was this patch tested?
    a new test.

You can merge this pull request into a Git repository by running:

    $ git pull python

Alternatively you can review and apply these changes as the patch at:

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #23248
commit 9477fb09b850b981862cb72b0ebdebc5b404a082
Author: Wenchen Fan <wenchen@...>
Date:   2018-12-06T11:16:04Z

    python udf in subquery



To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to