Re: Improve SparkR collect performance with Arrow

Wes McKinney Sun, 14 May 2017 11:46:52 -0700

Note I just opened https://github.com/wesm/feather/pull/297 which deletes
all of the Feather Python code (using pyarrow as a dependency).


On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> hi Dean,
>
> In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
> into the Arrow repo. The Feather format is a simplified version of the
> Arrow IPC format (which has file/batch and stream flavors), so the ideal
> approach would be to move the Feather R/Rcpp wrapper code into the Arrow
> codebase and generalize it to support the Arrow streams that are coming
> from Spark (as in SPARK-13534).
>
> Adding support for nested types should also be possible -- we have
> implemented more of the converters for them on the Python side. The Feather
> format doesn't support nested types, so we would want to deprecate that
> format as soon as practical (Feather has plenty of users; and we can always
> maintain the library(feather) import and associated R API).
>
> In any case, this seems like an ideal collaboration for the Spark and
> Arrow communities; what is missing is an experienced developer from the R
> community who can manage the R/Rcpp binding issues (I can help some with
> maintaining the C++ side of the bindings) and address packaging / builds /
> continuous integration.
>
> - Wes
>
> On Sun, May 14, 2017 at 1:26 PM, Dean Chen <d...@dv01.co> wrote:
>
>> Following up on the discussion from
>> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
>> cases that would benefit significantly from improved collect performance
>> and would like to kick off a similar proposal/effort to
>> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>>
>> Complex datatypes introduced additional complexity to 13534 and it's not a
>> requirement for us so thinking the initial proposal would be for simple
>> types with fall back on the current implementation for complex types.
>>
>> Integration would involve introducing a flag to enable the arrow
>> serialization logic *collect*(
>> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
>> )
>> that would call an Arrow implementation of *dfToCols*
>> https://github.com/apache/spark/blob/branch-2.2/sql/core/
>> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
>> that
>> returns Arrow byte arrays.
>>
>> Looks like https://github.com/wesm/feather hasn't been updated since the
>> Arrow 0.3 release so assuming it would have to be updated to enable
>> converting the byte array from dfToCols to R dataframes? Wes also brought
>> up that unified serialization implementation for Spark/Scala, R and Python
>> to enable easy sharing of IO optimizations.
>>
>> Please let us know your thoughts/opinions on the above and the preferred
>> way of collaborating with the Arrow community on this.
>> --
>> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> <http://www.forbes.com/fintech/2016/#310668d56680>
>> 915 Broadway | Suite 502 | New York, NY 10010
>> (646)-838-2310
>> d...@dv01.co | www.dv01.co
>>
>
>

Re: Improve SparkR collect performance with Arrow

Reply via email to