[ 
https://issues.apache.org/jira/browse/SPARK-14746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249673#comment-15249673
 ] 

Sun Rui commented on SPARK-14746:
---------------------------------

[~rxin] 
some links for discussions on calling R code from Scala:
Run External R script from Spark: 
https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/%3ccag06he009b4tonqd-rtkhlspiojchgpe2ularb09jhe55xn...@mail.gmail.com%3E
Running synchronized JRI code: 
https://www.mail-archive.com/[email protected]/msg45753.html

We also have a customer having similar requirement that: the applications are 
written in Scala/Java, but sometimes need to call R statistical functions in 
transformations.

There are similar requirements for calling python code from Scala, one example 
is: https://www.mail-archive.com/[email protected]/msg49653.html

The limitations of pipe():
1. Only RDD has pipe(). DataFrame does not.
2. pipe() uses text based communication between JVM and external processes. 
User have to manually serialize the data into text on JVM side (printRDDElement 
function as a parameter to pipe()) and deserialize the data in the external 
process. Difficult to use and have performance concern compared to binary 
communications.
3. Users have to write separate code in the target language for external 
processes. While if we support this proposal, users can embed the code (for 
example, R or python code) for external processes  in the Scala program. easier 
to maintain and beneficial to readability.
4. Hard to debug when external processes launched by pipe() failed. No detailed 
error message. In this proposal, error message can be caught, which eases 
debugging.


> Support transformations in R source code for Dataset/DataFrame
> --------------------------------------------------------------
>
>                 Key: SPARK-14746
>                 URL: https://issues.apache.org/jira/browse/SPARK-14746
>             Project: Spark
>          Issue Type: New Feature
>          Components: SparkR, SQL
>            Reporter: Sun Rui
>
> there actually is a desired scenario mentioned several times in the Spark 
> mailing list that users are writing Scala/Java Spark applications (not 
> SparkR) but want to use R functions in some transformations. typically this 
> can be achieved by calling Pipe() in RDD. However, there are limitations on 
> pipe(). So we can support applying a R function in source code format to a 
> Dataset/DataFrame (Thus SparkR is not needed for serializing an R function.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to