[jira] [Commented] (SPARK-12635) More efficient (column batch) serialization for Python/R

Shivaram Venkataraman (JIRA) Tue, 19 Jan 2016 18:39:50 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-12635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15107860#comment-15107860
 ]


Shivaram Venkataraman commented on SPARK-12635:
-----------------------------------------------

Just to clarify a couple of things - we should probably move this out to a new 
JIRA issue.
- The main purpose for creating the SerDe library in SparkR was to enable 
inter-process communication (IPC) between R and the JVM that if flexible, works 
on multiple platforms and works without needing too many dependencies. By IPC, 
I mean having the ability to call methods on the JVM from R. The reason for 
implementing this in Spark was that we need flexibility for either R or the JVM 
to come up first (as opposed to an embedded JVM) and also to make installing / 
deploying Spark easier.
- Using the same SerDe mechanism for collect is just a natural extension and as 
Spark is primarily tuned to do distributed operation we haven't profiled / 
benchmarked the collect performance so far. So your benchmarks are very useful 
and provide a baseline that we can improve on.
- In terms of future improvements I see two things (a) better benchmarks, 
profiling of the serialization costs -- we will also need to do this for the 
UDF work as we will be similarly transferring data from JVM to R and back there 
(b) designing or using a faster serialization for batch transfers like collect, 
UDFs. 

> More efficient (column batch) serialization for Python/R
> --------------------------------------------------------
>
>                 Key: SPARK-12635
>                 URL: https://issues.apache.org/jira/browse/SPARK-12635
>             Project: Spark
>          Issue Type: New Feature
>          Components: PySpark, SparkR, SQL
>            Reporter: Reynold Xin
>
> Serialization between Scala / Python / R is pretty slow. Python and R both 
> work pretty well with column batch interface (e.g. numpy arrays). Technically 
> we should be able to just pass column batches around with minimal 
> serialization (maybe even zero copy memory).
> Note that this depends on some internal refactoring to use a column batch 
> interface in Spark SQL.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-12635) More efficient (column batch) serialization for Python/R

Reply via email to