Re: Arrow and R benchmark

Wes McKinney Wed, 13 Feb 2019 13:26:50 -0800

Would someone like to make some feature requests to Google or engage
with them in another way? I have interacted with GCP in the past; I
think it would be helpful for them to hear from other Arrow users or
community members since I have been quite public as a carrier of the
Arrow banner.


On Tue, Feb 5, 2019 at 12:11 AM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Disclaimer: I work for Google (not on BQ).  Everything I'm going to write
> reflects my own opinions, not those of my company.
>
> Jonathan and Wes,
>
> One way of trying to get support for this is filing a feature request at
> [1] and getting broader customer support for it.  Another possible way of
> gaining broader exposure within Google is collaborating with other open
> source projects that it contributes to.  For instance there was a
> conversation recently about the potential use of Arrow on the Apache Beam
> mailing list [2].  I will try to post a link to this thread internally, but
> I can't make any promises and likely not give any updates on progress.
>
> This is also very much my own opinion, but I think in order to expose Arrow
> in a public API it would be nice to reach a stable major release (i.e.
> 1.0.0) and ensure Arrow properly supports big query data-types
> appropriately [3], (I think it mostly does but date/time might be an issue).
>
> [1]
> https://cloud.google.com/support/docs/issue-trackers#search_for_or_create_bugs_and_feature_requests_by_product
> [2]
> https://lists.apache.org/thread.html/32cbbe587016cd0ac9e1f7b1de457b0bd69936c88dfdc734ffa366db@%3Cdev.beam.apache.org%3E
> [3] https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
>
>
> On Monday, February 4, 2019, Wes McKinney <wesmck...@gmail.com> wrote:
>
> > Arrow support would be an obvious win for BigQuery. I've spoken with
> > people at Google Cloud about this in several occasions.
> >
> > With the gRPC / Flight work coming along it might be a good
> > opportunity to rekindle the discussion. If anyone from GCP is reading
> > or if you know anyone at GCP who might be able to work with us I would
> > be very interested.
> >
> > One hurdle for BigQuery is that my understanding is that Google has
> > policies in place that make it more difficult to take on external
> > library dependencies in a sensitive system like Dremel / BigQuery. So
> > someone from Google might have to develop an in-house Arrow
> > implementation sufficient to send Arrow datasets from BigQuery to
> > clients. The scope of that project is small enough (requiring only
> > Flatbuffers as a dependency) that a motivated C or C++ developer at
> > Google ought to be able to get it done in a month or two of focused
> > work.
> >
> > - Wes
> >
> > On Mon, Feb 4, 2019 at 4:40 PM Jonathan Chiang <chiang...@gmail.com>
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > I am currently working a lot with Google BigQuery in R and Python.
> > Hadley Wickham listed this as a big bottleneck for his library bigrquery.
> > >
> > > The bottleneck for loading BigQuery data is now parsing BigQuery’s JSON
> > format, which is difficult to optimise further because I’m already using
> > the fastest C++ JSON parser, RapidJson. If this is still too slow (because
> > you download a lot of data), see ?bq_table_download for an alternative
> > approach.
> > >
> > > Is there any momentum for Arrow to partner with Google here?
> > >
> > > Thanks,
> > >
> > > Jonathan
> > >
> > >
> > >
> > > On Mon, Dec 3, 2018 at 7:03 PM Wes McKinney <wesmck...@gmail.com> wrote:
> > >>
> > >> hi Jonathan,
> > >> On Sat, Nov 24, 2018 at 6:19 PM Jonathan Chiang <chiang...@gmail.com>
> > wrote:
> > >> >
> > >> > Hi Wes and Romain,
> > >> >
> > >> > I wrote a preliminary benchmark for reading and writing different
> > file types from R into arrow, borrowed some code from Hadley. I would like
> > some feedback to improve it and then possible push a R/benchmarks folder. I
> > am willing to dedicate most of next week to this project, as I am taking a
> > vacation from work, but would like to contribute to Arrow and R.
> > >> >
> > >> > To Romain: What is the difference in R when using tibble versus
> > reading from arrow?
> > >> > Is the general advantage that you can serialize the data to arrow
> > when saving it? Then be able to call it in Python with arrow then pandas?
> > >>
> > >> Arrow has a language-independent binary protocol for data interchange
> > >> that does not require deserialization of data on read. It can be read
> > >> or written in many different ways: files, sockets, shared memory, etc.
> > >> How it gets used depends on the application
> > >>
> > >> >
> > >> > General Roadmap Question to Wes and Romain :
> > >> > My vision for the future of data science, is the ability to serialize
> > data securely and pass data and models securely with some form of
> > authentication between IDEs with secure ports. This idea would develop with
> > something similar to gRPC, with more security designed with sharing data. I
> > noticed flight gRpc.
> > >> >
> > >>
> > >> Correct, our plan for RPC is to use gRPC for secure transport of
> > >> components of the Arrow columnar protocol. We'd love to have more
> > >> developers involved with this effort.
> > >>
> > >> > Also, I was interested if there was any momentum in  the R community
> > to serialize models similar to the work of Onnx into a unified model
> > storage system. The idea is to have a secure reproducible environment for R
> > and Python developer groups to readily share models and data, with the
> > caveat that data sent also has added security and possibly a history
> > associated with it for security. This piece of work, is something I am
> > passionate in seeing come to fruition. And would like to explore options
> > for this actualization.
> > >> >
> > >>
> > >> Here we are focused on efficient handling and processing of datasets.
> > >> These tools could be used to build a model storage system if so
> > >> desired.
> > >>
> > >> > The background for me is to enable HealthCare teams to share medical
> > data securely among different analytics teams. The security provisions
> > would enable more robust cloud based storage and computation in a secure
> > fashion.
> > >> >
> > >>
> > >> I would like to see deeper integration with cloud storage services in
> > >> 2019 in the core C++ libraries, which would be made available in R,
> > >> Python, Ruby, etc.
> > >>
> > >> - Wes
> > >>
> > >> > Thanks,
> > >> > Jonathan
> > >> >
> > >> >
> > >> >
> > >> > Side Note:
> > >> > Building arrow for R on Linux was a big hassle relative to mac. Was
> > unable to build on linux.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Thu, Nov 15, 2018 at 7:50 PM Jonathan Chiang <chiang...@gmail.com>
> > wrote:
> > >> >>
> > >> >> I'll go through that python repo and see what I can do.
> > >> >>
> > >> >> Thanks,
> > >> >> Jonathan
> > >> >>
> > >> >> On Thu, Nov 15, 2018 at 1:55 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >> >>>
> > >> >>> I would suggest starting an r/benchmarks directory like we have in
> > >> >>> Python (
> > https://github.com/apache/arrow/tree/master/python/benchmarks)
> > >> >>> and documenting the process for running all the benchmarks.
> > >> >>> On Thu, Nov 15, 2018 at 4:52 PM Romain François <rom...@purrple.cat>
> > wrote:
> > >> >>> >
> > >> >>> > Right now, most of the code examples is in the unit tests, but
> > this is not measuring performance or stressing it. Perhaps you can start
> > from there ?
> > >> >>> >
> > >> >>> > Romain
> > >> >>> >
> > >> >>> > > Le 15 nov. 2018 à 22:16, Wes McKinney <wesmck...@gmail.com> a
> > écrit :
> > >> >>> > >
> > >> >>> > > Adding dev@arrow.apache.org
> > >> >>> > >> On Thu, Nov 15, 2018 at 4:13 PM Jonathan Chiang <
> > chiang...@gmail.com> wrote:
> > >> >>> > >>
> > >> >>> > >> Hi,
> > >> >>> > >>
> > >> >>> > >> I would like to contribute to developing benchmark suites for
> > R and Arrow? What would be the best way to start?
> > >> >>> > >>
> > >> >>> > >> Thanks,
> > >> >>> > >> Jonathan
> > >> >>> >
> >

Re: Arrow and R benchmark

Reply via email to