Hi Wes, I am currently working a lot with Google BigQuery in R and Python. Hadley Wickham listed this as a big bottleneck for his library bigrquery.
*The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON format, which is difficult to optimise further because I’m already using the fastest C++ JSON parser, RapidJson <http://rapidjson.org/>. If this is still too slow (because you download a lot of data), see ?bq_table_download for an alternative approach.* Is there any momentum for Arrow to partner with Google here? Thanks, Jonathan On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <wesmck...@gmail.com> wrote: > hi Jonathan, > On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <chiang...@gmail.com> > wrote: > > > > Hi Wes and Romain, > > > > I wrote a preliminary benchmark for reading and writing different file > types from R into arrow, borrowed some code from Hadley. I would like some > feedback to improve it and then possible push a R/benchmarks folder. I am > willing to dedicate most of next week to this project, as I am taking a > vacation from work, but would like to contribute to Arrow and R. > > > > To Romain: What is the difference in R when using tibble versus reading > from arrow? > > Is the general advantage that you can serialize the data to arrow when > saving it? Then be able to call it in Python with arrow then pandas? > > Arrow has a language-independent binary protocol for data interchange > that does not require deserialization of data on read. It can be read > or written in many different ways: files, sockets, shared memory, etc. > How it gets used depends on the application > > > > > General Roadmap Question to Wes and Romain : > > My vision for the future of data science, is the ability to serialize > data securely and pass data and models securely with some form of > authentication between IDEs with secure ports. This idea would develop with > something similar to gRPC, with more security designed with sharing data. I > noticed flight gRpc. > > > > Correct, our plan for RPC is to use gRPC for secure transport of > components of the Arrow columnar protocol. We'd love to have more > developers involved with this effort. > > > Also, I was interested if there was any momentum in the R community to > serialize models similar to the work of Onnx into a unified model storage > system. The idea is to have a secure reproducible environment for R and > Python developer groups to readily share models and data, with the caveat > that data sent also has added security and possibly a history associated > with it for security. This piece of work, is something I am passionate in > seeing come to fruition. And would like to explore options for this > actualization. > > > > Here we are focused on efficient handling and processing of datasets. > These tools could be used to build a model storage system if so > desired. > > > The background for me is to enable HealthCare teams to share medical > data securely among different analytics teams. The security provisions > would enable more robust cloud based storage and computation in a secure > fashion. > > > > I would like to see deeper integration with cloud storage services in > 2019 in the core C++ libraries, which would be made available in R, > Python, Ruby, etc. > > - Wes > > > Thanks, > > Jonathan > > > > > > > > Side Note: > > Building arrow for R on Linux was a big hassle relative to mac. Was > unable to build on linux. > > > > > > > > > > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <chiang...@gmail.com> > wrote: > >> > >> I'll go through that python repo and see what I can do. > >> > >> Thanks, > >> Jonathan > >> > >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <wesmck...@gmail.com> > wrote: > >>> > >>> I would suggest starting an r/benchmarks directory like we have in > >>> Python (https://github.com/apache/arrow/tree/master/python/benchmarks) > >>> and documenting the process for running all the benchmarks. > >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <rom...@purrple.cat> > wrote: > >>> > > >>> > Right now, most of the code examples is in the unit tests, but this > is not measuring performance or stressing it. Perhaps you can start from > there ? > >>> > > >>> > Romain > >>> > > >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <wesmck...@gmail.com> a > écrit : > >>> > > > >>> > > Adding dev@arrow.apache.org > >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang < > chiang...@gmail.com> wrote: > >>> > >> > >>> > >> Hi, > >>> > >> > >>> > >> I would like to contribute to developing benchmark suites for R > and Arrow? What would be the best way to start? > >>> > >> > >>> > >> Thanks, > >>> > >> Jonathan > >>> > >