[
https://issues.apache.org/jira/browse/SPARK-14746?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15249673#comment-15249673
]
Sun Rui commented on SPARK-14746:
---------------------------------
[~rxin]
some links for discussions on calling R code from Scala:
Run External R script from Spark:
https://mail-archives.apache.org/mod_mbox/spark-user/201603.mbox/%3ccag06he009b4tonqd-rtkhlspiojchgpe2ularb09jhe55xn...@mail.gmail.com%3E
Running synchronized JRI code:
https://www.mail-archive.com/[email protected]/msg45753.html
We also have a customer having similar requirement that: the applications are
written in Scala/Java, but sometimes need to call R statistical functions in
transformations.
There are similar requirements for calling python code from Scala, one example
is: https://www.mail-archive.com/[email protected]/msg49653.html
The limitations of pipe():
1. Only RDD has pipe(). DataFrame does not.
2. pipe() uses text based communication between JVM and external processes.
User have to manually serialize the data into text on JVM side (printRDDElement
function as a parameter to pipe()) and deserialize the data in the external
process. Difficult to use and have performance concern compared to binary
communications.
3. Users have to write separate code in the target language for external
processes. While if we support this proposal, users can embed the code (for
example, R or python code) for external processes in the Scala program. easier
to maintain and beneficial to readability.
4. Hard to debug when external processes launched by pipe() failed. No detailed
error message. In this proposal, error message can be caught, which eases
debugging.
> Support transformations in R source code for Dataset/DataFrame
> --------------------------------------------------------------
>
> Key: SPARK-14746
> URL: https://issues.apache.org/jira/browse/SPARK-14746
> Project: Spark
> Issue Type: New Feature
> Components: SparkR, SQL
> Reporter: Sun Rui
>
> there actually is a desired scenario mentioned several times in the Spark
> mailing list that users are writing Scala/Java Spark applications (not
> SparkR) but want to use R functions in some transformations. typically this
> can be achieved by calling Pipe() in RDD. However, there are limitations on
> pipe(). So we can support applying a R function in source code format to a
> Dataset/DataFrame (Thus SparkR is not needed for serializing an R function.)
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]