Especially with Arrow support landing in Spark (SPARK-13534), it would
be helpful to combine efforts between Python and R on this front. I
also have a long list of improvements to the Feather format that will
be substantially simpler once library(feather) is depending on the
main Arrow libraries.

I suggest you reach out to members of the R community directly on
public forums about development help / advice and soliciting
collaboration. There are other R venues where you can describe your
use cases, like the R Consortium and its subcommittees:
https://www.r-consortium.org/. I would go directly to the mailing
lists and see if there is anyone who would like to get involved. It's
more likely that you'll get attention on this problem in the R mailing
lists than on the Arrow mailing list due to the chicken-and-egg
aspect.

As a side note, my opinion is that shared storage, memory formats, and
computing libraries (e.g. native C++ libraries targeting Arrow memory)
are going to be more and more important to the R / Python / Julia
communities (and beyond -- Kou has been developing Arrow interfaces
for Ruby, which has not traditionally had a large data science
community) as time passes. I would like to personally do more on the R
side but I simply don't have the bandwidth to take responsibility for
another major component, especially not in an unfamiliar software
development stack.

Let me know how I can help, and if there are R mailing list
discussions where we (the Arrow developers) can chime in please alert
us to them here.

- Wes

On Wed, Jul 19, 2017 at 5:29 PM, Dean Chen <d...@dv01.co> wrote:
> I also sent a note about it to the dev list a month ago. Still have a huge
> internal need and interested in helping push this along where we can.
> Unfortunately, our team is more focused around Spark and doesn't have much
> experience working with the R community.
>
> On Wed, Jul 19, 2017 at 1:44 PM Clark Fitzgerald <clarkfi...@gmail.com>
> wrote:
>
>> Hello all,
>>
>> I saw the notes come through from today's call:
>>
>> > * R Arrow Bindings?
>> >  - Find use cases within the R community, contributors needed
>> >  - R Feather bindings a useful starting point
>>
>> This year I've been working on parallel R on datasets in the 100+ GB range,
>> and have found that loading and saving data from text files is a real
>> bottleneck. Another consideration is breaking the data up into chunks for
>> parallel processing while maintaining metadata and overall structure. So
>> I've been watching Parquet and Arrow.
>>
>> Specifically here are two use cases in R where Arrow / Parquet could be
>> helpful:
>>
>> - Splitting up a large data set into pieces which fit comfortably in memory
>> then applying normal R functions to each piece. Basically GROUP BY.
>> - Matloff's Software Alchemy, statistical averaging based on independent
>> chunks of data. This requires rows to be randomly assigned to chunks.
>>
>> Another option besides starting from the R Feather bindings is to start
>> with an automatically generated set of bindings:
>> https://github.com/duncantl/RCodeGen
>>
>> Best,
>> Clark Fitzgerald
>>
> --
> VP of Engineering - dv01, Featured in Forbes Fintech 50 For 2016
> <http://www.forbes.com/fintech/2016/#310668d56680>
> 915 Broadway | Suite 502 | New York, NY 10010
> (646)-838-2310
> d...@dv01.co | www.dv01.co

Reply via email to