[
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097723#comment-15097723
]
Sun Rui commented on SPARK-6817:
--------------------------------
OK. I will follow the design of the original design doc, while I think the term
of UDF here is a little bit confusing.
For dapply(), basically it sounds like passing an R function into
DataFrame.mapPartitions(). The R function takes a local data.frame as input
parameter, which is converted from a partition of the DataFrame.
For gapply(), it makes less sense to depend on UDAF, as 1. UDAF returns single
value, 2. UDAF is processed with each row each time, which is not efficient.
basically, converts the DataFrame to an RDD, and then call RDD.groupBy(), and
then feed the grouped values into R worker.
cc [~shivaram], any comments?
> DataFrame UDFs in R
> -------------------
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
> Issue Type: New Feature
> Components: SparkR, SQL
> Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after
> merging into Spark.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]