Note I just opened https://github.com/wesm/feather/pull/297 which deletes all of the Feather Python code (using pyarrow as a dependency).
On Sun, May 14, 2017 at 2:44 PM, Wes McKinney <wesmck...@gmail.com> wrote: > hi Dean, > > In Arrow 0.3 we incorporated the C++ and Python code from wesm/feather > into the Arrow repo. The Feather format is a simplified version of the > Arrow IPC format (which has file/batch and stream flavors), so the ideal > approach would be to move the Feather R/Rcpp wrapper code into the Arrow > codebase and generalize it to support the Arrow streams that are coming > from Spark (as in SPARK-13534). > > Adding support for nested types should also be possible -- we have > implemented more of the converters for them on the Python side. The Feather > format doesn't support nested types, so we would want to deprecate that > format as soon as practical (Feather has plenty of users; and we can always > maintain the library(feather) import and associated R API). > > In any case, this seems like an ideal collaboration for the Spark and > Arrow communities; what is missing is an experienced developer from the R > community who can manage the R/Rcpp binding issues (I can help some with > maintaining the C++ side of the bindings) and address packaging / builds / > continuous integration. > > - Wes > > On Sun, May 14, 2017 at 1:26 PM, Dean Chen <d...@dv01.co> wrote: > >> Following up on the discussion from >> https://issues.apache.org/jira/browse/SPARK-18924. We have internal use >> cases that would benefit significantly from improved collect performance >> and would like to kick off a similar proposal/effort to >> https://issues.apache.org/jira/browse/SPARK-13534 for SparkR. >> >> Complex datatypes introduced additional complexity to 13534 and it's not a >> requirement for us so thinking the initial proposal would be for simple >> types with fall back on the current implementation for complex types. >> >> Integration would involve introducing a flag to enable the arrow >> serialization logic *collect*( >> https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129 >> ) >> that would call an Arrow implementation of *dfToCols* >> https://github.com/apache/spark/blob/branch-2.2/sql/core/ >> src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211 >> that >> returns Arrow byte arrays. >> >> Looks like https://github.com/wesm/feather hasn't been updated since the >> Arrow 0.3 release so assuming it would have to be updated to enable >> converting the byte array from dfToCols to R dataframes? Wes also brought >> up that unified serialization implementation for Spark/Scala, R and Python >> to enable easy sharing of IO optimizations. >> >> Please let us know your thoughts/opinions on the above and the preferred >> way of collaborating with the Arrow community on this. >> -- >> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 >> <http://www.forbes.com/fintech/2016/#310668d56680> >> 915 Broadway | Suite 502 | New York, NY 10010 >> (646)-838-2310 >> d...@dv01.co | www.dv01.co >> > >