GitHub user cloud-fan opened a pull request:
https://github.com/apache/spark/pull/22244
[WIP][SPARK-24721][SPARK-25213][SQL] extract python UDF at the end of
optimizer
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/12127 , we moved the
`ExtractPythonUDFs` rule to the physical phase, while there was another option:
do `ExtractPythonUDFs` at the end of optimizer.
Currently we hit 2 issues when exacting python UDFs at physical phase:
1. it happens after data source v2 strategy, so data source v2 strategy
needs to deal with python udfs carefully and adds project to produce unsafe row
for python udf. See https://github.com/apache/spark/pull/22206
2. it happens after file source strategy, so we may keep Python UDF as data
filter in `FileSourceScanExec` and fail the planner when try to extract it
later. See https://github.com/apache/spark/pull/22104
This PR proposes to move `ExtractPythonUDFs` to the end of optimizer.
## How was this patch tested?
TODO
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/cloud-fan/spark python
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/22244.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #22244
----
commit f0e547c971f854b8a238baaebff8103036567223
Author: Wenchen Fan <wenchen@...>
Date: 2018-08-27T15:40:18Z
extract python UDF at the end of optimizer
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]