[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097723#comment-15097723
 ] 

Sun Rui commented on SPARK-6817:
--------------------------------

OK. I will follow the design of the original design doc, while I think the term 
of UDF here is a little bit confusing.

For dapply(), basically it sounds like passing an R function into 
DataFrame.mapPartitions(). The R function takes a local data.frame as input 
parameter, which is converted from a partition of the DataFrame.

For gapply(), it makes less sense to depend on UDAF, as 1. UDAF returns single 
value, 2. UDAF is processed with each row each time, which is not efficient. 
basically, converts the DataFrame to an RDD, and then call RDD.groupBy(), and 
then feed the grouped values into R worker. 

cc [~shivaram], any comments?

> DataFrame UDFs in R
> -------------------
>
>                 Key: SPARK-6817
>                 URL: https://issues.apache.org/jira/browse/SPARK-6817
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR, SQL
>            Reporter: Shivaram Venkataraman
>         Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to