[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

Davies Liu (JIRA) Tue, 08 Sep 2015 14:27:14 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-8632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14735644#comment-14735644
 ]


Davies Liu commented on SPARK-8632:
-----------------------------------

The upstream means child of current SparkPlan, could have other Python UDFs. 

We remove the RDD cache in 1.4, then the upstream will be evaluated twice. If 
you have multiple Python UDFs, for example three, it will end up evaluate the 
child 8 times (2 x 2 x 2), which will be really slow or cause OOM.

In synchronous batch mode, what's the batch size? if it's small, the overhead 
of each batch will be high, if it's too large, it's easy to OOM if you have 
many columns. Also we need to copy the rows (serialization is not need if it's 
UnsafeRow).



> Poor Python UDF performance because of RDD caching
> --------------------------------------------------
>
>                 Key: SPARK-8632
>                 URL: https://issues.apache.org/jira/browse/SPARK-8632
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark, SQL
>    Affects Versions: 1.4.0
>            Reporter: Justin Uang
>            Assignee: Davies Liu
>
> {quote}
> We have been running into performance problems using Python UDFs with 
> DataFrames at large scale.
> From the implementation of BatchPythonEvaluation, it looks like the goal was 
> to reuse the PythonRDD code. It caches the entire child RDD so that it can do 
> two passes over the data. One to give to the PythonRDD, then one to join the 
> python lambda results with the original row (which may have java objects that 
> should be passed through).
> In addition, it caches all the columns, even the ones that don't need to be 
> processed by the Python UDF. In the cases I was working with, I had a 500 
> column table, and i wanted to use a python UDF for one column, and it ended 
> up caching all 500 columns. 
> {quote}
> http://apache-spark-developers-list.1001551.n3.nabble.com/Python-UDF-performance-at-large-scale-td12843.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8632) Poor Python UDF performance because of RDD caching

Reply via email to