[
https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107860#comment-15107860
]
Shivaram Venkataraman commented on SPARK-12635:
-----------------------------------------------
Just to clarify a couple of things - we should probably move this out to a new
JIRA issue.
- The main purpose for creating the SerDe library in SparkR was to enable
inter-process communication (IPC) between R and the JVM that if flexible, works
on multiple platforms and works without needing too many dependencies. By IPC,
I mean having the ability to call methods on the JVM from R. The reason for
implementing this in Spark was that we need flexibility for either R or the JVM
to come up first (as opposed to an embedded JVM) and also to make installing /
deploying Spark easier.
- Using the same SerDe mechanism for collect is just a natural extension and as
Spark is primarily tuned to do distributed operation we haven't profiled /
benchmarked the collect performance so far. So your benchmarks are very useful
and provide a baseline that we can improve on.
- In terms of future improvements I see two things (a) better benchmarks,
profiling of the serialization costs -- we will also need to do this for the
UDF work as we will be similarly transferring data from JVM to R and back there
(b) designing or using a faster serialization for batch transfers like collect,
UDFs.
> More efficient (column batch) serialization for Python/R
> --------------------------------------------------------
>
> Key: SPARK-12635
> URL: https://issues.apache.org/jira/browse/SPARK-12635
> Project: Spark
> Issue Type: New Feature
> Components: PySpark, SparkR, SQL
> Reporter: Reynold Xin
>
> Serialization between Scala / Python / R is pretty slow. Python and R both
> work pretty well with column batch interface (e.g. numpy arrays). Technically
> we should be able to just pass column batches around with minimal
> serialization (maybe even zero copy memory).
> Note that this depends on some internal refactoring to use a column batch
> interface in Spark SQL.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]