[jira] [Commented] (SPARK-19489) Stable serialization format for external & native code integration

Wes McKinney (JIRA) Wed, 08 Feb 2017 13:45:19 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-19489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858577#comment-15858577
 ]


Wes McKinney commented on SPARK-19489:
--------------------------------------

I'm really glad to see this is becoming a priority in 2017. 

In the same way that Google internally standardized on "RecordIO" and 
"ColumnIO" record-oriented and column-oriented serialization formats (and 
protocol buffers for anything not fitting those models), it may make sense to 
support both orientation-styles in a binary protocol to support different kinds 
of native code plugins. Spark SQL is internally record-oriented (but maybe 
in-memory columnar someday?), but native code plugins may be column-oriented. 

I've been helping lead efforts in Apache Arrow to have a stable 
lightweight/zero-copy column-oriented binary format for Python, R, and Java 
applications -- some of the results on integration with pandas and Parquet are 
encouraging:

* http://wesmckinney.com/blog/high-perf-arrow-to-pandas/
* http://wesmckinney.com/blog/arrow-streaming-columnar/
* http://wesmckinney.com/blog/python-parquet-update/

The initial Spark-Arrow work in SPARK-13534 is also promising, but having these 
kinds of fast IPC tools more deeply integrated into the Spark SQL execution 
engine (especially being able to collect results from task executors as 
serialized column batches) would unlock significantly higher performance. 

I'll be interested to learn more about the broader requirements of external 
serialization formats and the different types of use cases. 

> Stable serialization format for external & native code integration
> ------------------------------------------------------------------
>
>                 Key: SPARK-19489
>                 URL: https://issues.apache.org/jira/browse/SPARK-19489
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.0
>            Reporter: Reynold Xin
>
> As a Spark user, I want access to a (semi) stable serialization format that 
> is high performance so I can integrate Spark with my application written in 
> native code (C, C++, Rust, etc).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-19489) Stable serialization format for external & native code integration

Reply via email to