Following up on the discussion from https://issues.apache.org/jira/browse/SPARK-18924. We have internal use cases that would benefit significantly from improved collect performance and would like to kick off a similar proposal/effort to https://issues.apache.org/jira/browse/SPARK-13534 for SparkR.
Complex datatypes introduced additional complexity to 13534 and it's not a requirement for us so thinking the initial proposal would be for simple types with fall back on the current implementation for complex types. Integration would involve introducing a flag to enable the arrow serialization logic *collect*( https://github.com/apache/spark/blob/branch-2.2/R/pkg/R/DataFrame.R#L1129) that would call an Arrow implementation of *dfToCols* https://github.com/apache/spark/blob/branch-2.2/sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala#L211 that returns Arrow byte arrays. Looks like https://github.com/wesm/feather hasn't been updated since the Arrow 0.3 release so assuming it would have to be updated to enable converting the byte array from dfToCols to R dataframes? Wes also brought up that unified serialization implementation for Spark/Scala, R and Python to enable easy sharing of IO optimizations. Please let us know your thoughts/opinions on the above and the preferred way of collaborating with the Arrow community on this. -- VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 <http://www.forbes.com/fintech/2016/#310668d56680> 915 Broadway | Suite 502 | New York, NY 10010 (646)-838-2310 d...@dv01.co | www.dv01.co