Re: Arrow and R benchmark

Jonathan Chiang Mon, 04 Feb 2019 14:40:45 -0800

Hi Wes,

I am currently working a lot with Google BigQuery in R and Python. Hadley
Wickham listed this as a big bottleneck for his library bigrquery.


*The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
format, which is difficult to optimise further because I’m already using
the fastest C++ JSON parser, RapidJson <http://rapidjson.org/>. If this is
still too slow (because you download a lot of data),
see ?bq_table_download for an alternative approach.*

Is there any momentum for Arrow to partner with Google here?

Thanks,

Jonathan



On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <wesmck...@gmail.com> wrote:

> hi Jonathan,
> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <chiang...@gmail.com>
> wrote:
> >
> > Hi Wes and Romain,
> >
> > I wrote a preliminary benchmark for reading and writing different file
> types from R into arrow, borrowed some code from Hadley. I would like some
> feedback to improve it and then possible push a R/benchmarks folder. I am
> willing to dedicate most of next week to this project, as I am taking a
> vacation from work, but would like to contribute to Arrow and R.
> >
> > To Romain: What is the difference in R when using tibble versus reading
> from arrow?
> > Is the general advantage that you can serialize the data to arrow when
> saving it? Then be able to call it in Python with arrow then pandas?
>
> Arrow has a language-independent binary protocol for data interchange
> that does not require deserialization of data on read. It can be read
> or written in many different ways: files, sockets, shared memory, etc.
> How it gets used depends on the application
>
> >
> > General Roadmap Question to Wes and Romain :
> > My vision for the future of data science, is the ability to serialize
> data securely and pass data and models securely with some form of
> authentication between IDEs with secure ports. This idea would develop with
> something similar to gRPC, with more security designed with sharing data. I
> noticed flight gRpc.
> >
>
> Correct, our plan for RPC is to use gRPC for secure transport of
> components of the Arrow columnar protocol. We'd love to have more
> developers involved with this effort.
>
> > Also, I was interested if there was any momentum in  the R community to
> serialize models similar to the work of Onnx into a unified model storage
> system. The idea is to have a secure reproducible environment for R and
> Python developer groups to readily share models and data, with the caveat
> that data sent also has added security and possibly a history associated
> with it for security. This piece of work, is something I am passionate in
> seeing come to fruition. And would like to explore options for this
> actualization.
> >
>
> Here we are focused on efficient handling and processing of datasets.
> These tools could be used to build a model storage system if so
> desired.
>
> > The background for me is to enable HealthCare teams to share medical
> data securely among different analytics teams. The security provisions
> would enable more robust cloud based storage and computation in a secure
> fashion.
> >
>
> I would like to see deeper integration with cloud storage services in
> 2019 in the core C++ libraries, which would be made available in R,
> Python, Ruby, etc.
>
> - Wes
>
> > Thanks,
> > Jonathan
> >
> >
> >
> > Side Note:
> > Building arrow for R on Linux was a big hassle relative to mac. Was
> unable to build on linux.
> >
> >
> >
> >
> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <chiang...@gmail.com>
> wrote:
> >>
> >> I'll go through that python repo and see what I can do.
> >>
> >> Thanks,
> >> Jonathan
> >>
> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <wesmck...@gmail.com>
> wrote:
> >>>
> >>> I would suggest starting an r/benchmarks directory like we have in
> >>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks)
> >>> and documenting the process for running all the benchmarks.
> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <rom...@purrple.cat>
> wrote:
> >>> >
> >>> > Right now, most of the code examples is in the unit tests, but this
> is not measuring performance or stressing it. Perhaps you can start from
> there ?
> >>> >
> >>> > Romain
> >>> >
> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <wesmck...@gmail.com> a
> écrit :
> >>> > >
> >>> > > Adding dev@arrow.apache.org
> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> chiang...@gmail.com> wrote:
> >>> > >>
> >>> > >> Hi,
> >>> > >>
> >>> > >> I would like to contribute to developing benchmark suites for R
> and Arrow? What would be the best way to start?
> >>> > >>
> >>> > >> Thanks,
> >>> > >> Jonathan
> >>> >
>

Re: Arrow and R benchmark

Reply via email to