I can try to help.
_____________________________ From: Wes McKinney <wesmck...@gmail.com> Sent: Monday, May 15, 2017 12:49 PM Subject: Re: Improve SparkR collect performance with Arrow To: Dirk Eddelbuettel <d...@eddelbuettel.com>, <dev@arrow.apache.org>, Jim Hester <james.hes...@rstudio.com>, Hadley Wickham <had...@rstudio.com>, Kevin Ushey <ke...@rstudio.com> Adding Hadley and others to the conversation to advise on the best path forward. I am happy to help with maintenance of the C++ code. For example, if there are API changes that affect the Rcpp bindings, I would help fix them. We have GLib-based C and Cython bindings (which is like Rcpp for Python), so this adds another binding layer to the mix which is no problem. I am eager to be doing work for the benefit of the R community, so hopefully among all of us we can find a division of labor that will advance this effort. Thanks Wes On Mon, May 15, 2017 at 11:01 AM, Dean Chen <d...@dv01.co> wrote: > Hi Wes, > > We can work with the Spark community on the Spark/SparkR integration. > > Also happy to help with migrating the R package from Feather in to Arrow. > > Have anyone in mind to manage the R/Rcpp binding issues? I reviewed the R > and cpp files in https://github.com/wesm/feather/tree/master/R and we may > be able to take a first pass on it to get things kicked off the ground. > Will still want an expert with Rcpp to review and own since we're not > experts with Rcpp and I'm sure it's riddled with lots of caveats like any > other fdw. > > We maintaining lots of R packages internally and can help or take the lead > on R packaging/builds/testing in travis in the Arrow project. > > On Sun, May 14, 2017 at 2:46 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> Note I just opened https://github.com/wesm/feather/pull/297 which deletes >> all of the Feather Python code (using pyarrow as a dependency). >> >> On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote: >> >> > hi Dean, >> > >> > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather >> > into the Arrow repo. The Feather format is a simplified version of the >> > Arrow IPC format (which has file/batch and stream flavors), so the ideal >> > approach would be to move the Feather R/Rcpp wrapper code into the Arrow >> > codebase and generalize it to support the Arrow streams that are coming >> > from Spark (as in SPARK-13534). >> > >> > Adding support for nested types should also be possible -- we have >> > implemented more of the converters for them on the Python side. The >> Feather >> > format doesn't support nested types, so we would want to deprecate that >> > format as soon as practical (Feather has plenty of users; and we can >> always >> > maintain the library(feather) import and associated R API). >> > >> > In any case, this seems like an ideal collaboration for the Spark and >> > Arrow communities; what is missing is an experienced developer from the R >> > community who can manage the R/Rcpp binding issues (I can help some with >> > maintaining the C++ side of the bindings) and address packaging / builds >> / >> > continuous integration. >> > >> > - Wes >> > >> > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <d...@dv01.co> wrote: >> > >> >> Following up on the discussion from >> >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use >> >> cases that would benefit significantly from improved collect performance >> >> and would like to kick off a similar proposal/effort to >> >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR. >> >> >> >> Complex datatypes introduced additional complexity to 13534 and it's >> not a >> >> requirement for us so thinking the initial proposal would be for simple >> >> types with fall back on the current implementation for complex types. >> >> >> >> Integration would involve introducing a flag to enable the arrow >> >> serialization logic *collect*( >> >> >> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129 >> >> ) >> >> that would call an Arrow implementation of *dfToCols* >> >> https://github.com/apache/spark/blob/branch-2.2/sql/core/ >> >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211 >> >> that >> >> returns Arrow byte arrays. >> >> >> >> Looks like https://github.com/wesm/feather hasn't been updated since >> the >> >> Arrow 0.3 release so assuming it would have to be updated to enable >> >> converting the byte array from dfToCols to R dataframes? Wes also >> brought >> >> up that unified serialization implementation for Spark/Scala, R and >> Python >> >> to enable easy sharing of IO optimizations. >> >> >> >> Please let us know your thoughts/opinions on the above and the preferred >> >> way of collaborating with the Arrow community on this. >> >> -- >> >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 >> >> <http://www.forbes.com/fintech/2016/#310668d56680> >> >> 915 Broadway | Suite 502 | New York, NY 10010 >> >> (646)-838-2310 <(646)%20838-2310> >> >> d...@dv01.co | www.dv01.co >> >> >> > >> > >> > -- > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 > <http://www.forbes.com/fintech/2016/#310668d56680> > 915 Broadway | Suite 502 | New York, NY 10010 > (646)-838-2310 > d...@dv01.co | www.dv01.co