Re: Improve SparkR collect performance with Arrow

Wes McKinney Sun, 14 May 2017 11:45:49 -0700

hi Dean,

In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather into
the Arrow repo. The Feather format is a simplified version of the Arrow IPC
format (which has file/batch and stream flavors), so the ideal approach
would be to move the Feather R/Rcpp wrapper code into the Arrow codebase
and generalize it to support the Arrow streams that are coming from Spark
(as in SPARK-13534).


Adding support for nested types should also be possible -- we have
implemented more of the converters for them on the Python side. The Feather
format doesn't support nested types, so we would want to deprecate that
format as soon as practical (Feather has plenty of users; and we can always
maintain the library(feather) import and associated R API).

In any case, this seems like an ideal collaboration for the Spark and Arrow
communities; what is missing is an experienced developer from the R
community who can manage the R/Rcpp binding issues (I can help some with
maintaining the C++ side of the bindings) and address packaging / builds /
continuous integration.

- Wes

On Sun, May 14, 2017 at 1:26 PM, Dean Chen <d...@dv01.co> wrote:

> Following up on the discussion from
> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
> cases that would benefit significantly from improved collect performance
> and would like to kick off a similar proposal/effort to
> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>
> Complex datatypes introduced additional complexity to 13534 and it's not a
> requirement for us so thinking the initial proposal would be for simple
> types with fall back on the current implementation for complex types.
>
> Integration would involve introducing a flag to enable the arrow
> serialization logic *collect*(
> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129)
> that would call an Arrow implementation of *dfToCols*
> https://github.com/apache/spark/blob/branch-2.2/sql/
> core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
> that
> returns Arrow byte arrays.
>
> Looks like https://github.com/wesm/feather hasn't been updated since the
> Arrow 0.3 release so assuming it would have to be updated to enable
> converting the byte array from dfToCols to R dataframes? Wes also brought
> up that unified serialization implementation for Spark/Scala, R and Python
> to enable easy sharing of IO optimizations.
>
> Please let us know your thoughts/opinions on the above and the preferred
> way of collaborating with the Arrow community on this.
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> d...@dv01.co | www.dv01.co
>

Re: Improve SparkR collect performance with Arrow

Reply via email to