[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15097458#comment-15097458
 ] 

Sun Rui commented on SPARK-6817:
--------------------------------

projecting batching rows for UDF are implmentation optimizations for saving 
communication cost between JVM and non-JVM interpreter like python or R, so 
there is no documentation. Conceptually, UDF is passed in a row each time. But 
for R, which typically handles vectors, it is feasible to transform the batch 
of rows into columns and pass the column vector into R UDF as arguments. But 
this may need clear statement saying that R UDF is expected to handle vector 
arguments from a batch of rows. The output of R UDF is still vectors, that can 
be passed back to JVM as result. In this way, the UDF actually is called once 
on the batch of rows.

For UDF, you don't need care about the last row, each row is processed 
independently.

> DataFrame UDFs in R
> -------------------
>
>                 Key: SPARK-6817
>                 URL: https://issues.apache.org/jira/browse/SPARK-6817
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR, SQL
>            Reporter: Shivaram Venkataraman
>         Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to