[
https://issues.apache.org/jira/browse/SPARK-19489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858577#comment-15858577
]
Wes McKinney commented on SPARK-19489:
--------------------------------------
I'm really glad to see this is becoming a priority in 2017.
In the same way that Google internally standardized on "RecordIO" and
"ColumnIO" record-oriented and column-oriented serialization formats (and
protocol buffers for anything not fitting those models), it may make sense to
support both orientation-styles in a binary protocol to support different kinds
of native code plugins. Spark SQL is internally record-oriented (but maybe
in-memory columnar someday?), but native code plugins may be column-oriented.
I've been helping lead efforts in Apache Arrow to have a stable
lightweight/zero-copy column-oriented binary format for Python, R, and Java
applications -- some of the results on integration with pandas and Parquet are
encouraging:
* http://wesmckinney.com/blog/high-perf-arrow-to-pandas/
* http://wesmckinney.com/blog/arrow-streaming-columnar/
* http://wesmckinney.com/blog/python-parquet-update/
The initial Spark-Arrow work in SPARK-13534 is also promising, but having these
kinds of fast IPC tools more deeply integrated into the Spark SQL execution
engine (especially being able to collect results from task executors as
serialized column batches) would unlock significantly higher performance.
I'll be interested to learn more about the broader requirements of external
serialization formats and the different types of use cases.
> Stable serialization format for external & native code integration
> ------------------------------------------------------------------
>
> Key: SPARK-19489
> URL: https://issues.apache.org/jira/browse/SPARK-19489
> Project: Spark
> Issue Type: New Feature
> Components: Spark Core, SQL
> Affects Versions: 2.2.0
> Reporter: Reynold Xin
>
> As a Spark user, I want access to a (semi) stable serialization format that
> is high performance so I can integrate Spark with my application written in
> native code (C, C++, Rust, etc).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]