Improve SparkR collect performance with Arrow

Dean Chen Sun, 14 May 2017 10:26:54 -0700

Following up on the discussion from
https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
cases that would benefit significantly from improved collect performance
and would like to kick off a similar proposal/effort to
https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.


Complex datatypes introduced additional complexity to 13534 and it's not a
requirement for us so thinking the initial proposal would be for simple
types with fall back on the current implementation for complex types.

Integration would involve introducing a flag to enable the arrow
serialization logic *collect*(
https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129)
that would call an Arrow implementation of *dfToCols*
https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
that
returns Arrow byte arrays.

Looks like https://github.com/wesm/feather hasn't been updated since the
Arrow 0.3 release so assuming it would have to be updated to enable
converting the byte array from dfToCols to R dataframes? Wes also brought
up that unified serialization implementation for Spark/Scala, R and Python
to enable easy sharing of IO optimizations.

Please let us know your thoughts/opinions on the above and the preferred
way of collaborating with the Arrow community on this.
-- 
VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
<http://www.forbes.com/fintech/2016/#310668d56680>
915 Broadway | Suite 502 | New York, NY 10010
(646)-838-2310
d...@dv01.co | www.dv01.co

Improve SparkR collect performance with Arrow

Reply via email to