Re: Improve SparkR collect performance with Arrow

Felix Cheung Mon, 15 May 2017 11:16:12 -0700

I can try to help.


_____________________________
From: Wes McKinney <wesmck...@gmail.com>
Sent: Monday, May 15, 2017 12:49 PM
Subject: Re: Improve SparkR collect performance with Arrow
To: Dirk Eddelbuettel <d...@eddelbuettel.com>, <dev@arrow.apache.org>, Jim 
Hester <james.hes...@rstudio.com>, Hadley Wickham <had...@rstudio.com>, Kevin 
Ushey <ke...@rstudio.com>


Adding Hadley and others to the conversation to advise on the best path forward.

I am happy to help with maintenance of the C++ code. For example, if
there are API changes that affect the Rcpp bindings, I would help fix
them. We have GLib-based C and Cython bindings (which is like Rcpp for
Python), so this adds another binding layer to the mix which is no
problem.

I am eager to be doing work for the benefit of the R community, so
hopefully among all of us we can find a division of labor that will
advance this effort.

Thanks
Wes

On Mon, May 15, 2017 at 11:01 AM, Dean Chen <d...@dv01.co> wrote:
> Hi Wes,
>
> We can work with the Spark community on the Spark/SparkR integration.
>
> Also happy to help with migrating the R package from Feather in to Arrow.
>
> Have anyone in mind to manage the R/Rcpp binding issues? I reviewed the R
> and cpp files in https://github.com/wesm/feather/tree/master/R and we may
> be able to take a first pass on it to get things kicked off the ground.
> Will still want an expert with Rcpp to review and own since we're not
> experts with Rcpp and I'm sure it's riddled with lots of caveats like any
> other fdw.
>
> We maintaining lots of R packages internally and can help or take the lead
> on R packaging/builds/testing in travis in the Arrow project.
>
> On Sun, May 14, 2017 at 2:46 PM Wes McKinney <wesmck...@gmail.com> wrote:
>
>> Note I just opened https://github.com/wesm/feather/pull/297 which deletes
>> all of the Feather Python code (using pyarrow as a dependency).
>>
>> On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>
>> > hi Dean,
>> >
>> > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather
>> > into the Arrow repo. The Feather format is a simplified version of the
>> > Arrow IPC format (which has file/batch and stream flavors), so the ideal
>> > approach would be to move the Feather R/Rcpp wrapper code into the Arrow
>> > codebase and generalize it to support the Arrow streams that are coming
>> > from Spark (as in SPARK-13534).
>> >
>> > Adding support for nested types should also be possible -- we have
>> > implemented more of the converters for them on the Python side. The
>> Feather
>> > format doesn't support nested types, so we would want to deprecate that
>> > format as soon as practical (Feather has plenty of users; and we can
>> always
>> > maintain the library(feather) import and associated R API).
>> >
>> > In any case, this seems like an ideal collaboration for the Spark and
>> > Arrow communities; what is missing is an experienced developer from the R
>> > community who can manage the R/Rcpp binding issues (I can help some with
>> > maintaining the C++ side of the bindings) and address packaging / builds
>> /
>> > continuous integration.
>> >
>> > - Wes
>> >
>> > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <d...@dv01.co> wrote:
>> >
>> >> Following up on the discussion from
>> >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use
>> >> cases that would benefit significantly from improved collect performance
>> >> and would like to kick off a similar proposal/effort to
>> >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
>> >>
>> >> Complex datatypes introduced additional complexity to 13534 and it's
>> not a
>> >> requirement for us so thinking the initial proposal would be for simple
>> >> types with fall back on the current implementation for complex types.
>> >>
>> >> Integration would involve introducing a flag to enable the arrow
>> >> serialization logic *collect*(
>> >>
>> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129
>> >> )
>> >> that would call an Arrow implementation of *dfToCols*
>> >> https://github.com/apache/spark/blob/branch-2.2/sql/core/
>> >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211
>> >> that
>> >> returns Arrow byte arrays.
>> >>
>> >> Looks like https://github.com/wesm/feather hasn't been updated since
>> the
>> >> Arrow 0.3 release so assuming it would have to be updated to enable
>> >> converting the byte array from dfToCols to R dataframes? Wes also
>> brought
>> >> up that unified serialization implementation for Spark/Scala, R and
>> Python
>> >> to enable easy sharing of IO optimizations.
>> >>
>> >> Please let us know your thoughts/opinions on the above and the preferred
>> >> way of collaborating with the Arrow community on this.
>> >> --
>> >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
>> >> <http://www.forbes.com/fintech/2016/#310668d56680>
>> >> 915 Broadway | Suite 502 | New York, NY 10010
>> >> (646)-838-2310 <(646)%20838-2310>
>> >> d...@dv01.co | www.dv01.co
>> >>
>> >
>> >
>>
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> d...@dv01.co | www.dv01.co

Re: Improve SparkR collect performance with Arrow

Reply via email to