Especially with Arrow support landing in Spark (SPARK-13534), it would be helpful to combine efforts between Python and R on this front. I also have a long list of improvements to the Feather format that will be substantially simpler once library(feather) is depending on the main Arrow libraries.
I suggest you reach out to members of the R community directly on public forums about development help / advice and soliciting collaboration. There are other R venues where you can describe your use cases, like the R Consortium and its subcommittees: https://www.r-consortium.org/. I would go directly to the mailing lists and see if there is anyone who would like to get involved. It's more likely that you'll get attention on this problem in the R mailing lists than on the Arrow mailing list due to the chicken-and-egg aspect. As a side note, my opinion is that shared storage, memory formats, and computing libraries (e.g. native C++ libraries targeting Arrow memory) are going to be more and more important to the R / Python / Julia communities (and beyond -- Kou has been developing Arrow interfaces for Ruby, which has not traditionally had a large data science community) as time passes. I would like to personally do more on the R side but I simply don't have the bandwidth to take responsibility for another major component, especially not in an unfamiliar software development stack. Let me know how I can help, and if there are R mailing list discussions where we (the Arrow developers) can chime in please alert us to them here. - Wes On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <d...@dv01.co> wrote: > I also sent a note about it to the dev list a month ago. Still have a huge > internal need and interested in helping push this along where we can. > Unfortunately, our team is more focused around Spark and doesn't have much > experience working with the R community. > > On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <clarkfi...@gmail.com> > wrote: > >> Hello all, >> >> I saw the notes come through from today's call: >> >> > * R Arrow Bindings? >> > - Find use cases within the R community, contributors needed >> > - R Feather bindings a useful starting point >> >> This year I've been working on parallel R on datasets in the 100+ GB range, >> and have found that loading and saving data from text files is a real >> bottleneck. Another consideration is breaking the data up into chunks for >> parallel processing while maintaining metadata and overall structure. So >> I've been watching Parquet and Arrow. >> >> Specifically here are two use cases in R where Arrow / Parquet could be >> helpful: >> >> - Splitting up a large data set into pieces which fit comfortably in memory >> then applying normal R functions to each piece. Basically GROUP BY. >> - Matloff's Software Alchemy, statistical averaging based on independent >> chunks of data. This requires rows to be randomly assigned to chunks. >> >> Another option besides starting from the R Feather bindings is to start >> with an automatically generated set of bindings: >> https://github.com/duncantl/RCodeGen >> >> Best, >> Clark Fitzgerald >> > -- > VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016 > <http://www.forbes.com/fintech/2016/#310668d56680> > 915 Broadway | Suite 502 | New York, NY 10010 > (646)-838-2310 > d...@dv01.co | www.dv01.co