hi Tim, I should think that the reader API should support deserializing a blob of schemaless Avro records as an Arrow record batch, or even feeding the reader one serialized record at a time to build a record batch incrementally
- Wes On Wed, Jun 12, 2019 at 1:25 PM Tim Swast <sw...@google.com.invalid> wrote: > > > Let me know if you want to collaborate on it. > > Thanks Micah. > > What are your thoughts on reading schemaless Avro bytes? One of the reasons > I have started experimenting with the fork is that fastavro had trouble > reading more than one row at a time from a schemaless reader. > > * • **Tim Swast* > * • *Software Friendliness Engineer > * • *Google Cloud Developer Relations > * • *Seattle, WA, USA > > > On Tue, Jun 11, 2019 at 10:29 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > Hi Tim, > > The avro support in C++ has been on my backlog for a while. I'm going to > > try to take the first few steps towards this over the next couple of days. > > Let me know if you want to collaborate on it. C++ is a lot nicer now then > > it was 8 years ago :) > > > > Cheers, > > Micah > > > > On Tue, Jun 11, 2019 at 6:40 PM Tim Swast <sw...@google.com.invalid> > > wrote: > > > > > Thanks for the advice, Wes. > > > > > > Unfortunately, I am about 8 years out of practice for writing any C++ > > > (which was part of the appeal of numba to me). Sounds like I should > > refresh > > > my skills. I like the idea of write one, have good performance > > everywhere. > > > > > > On Tue, Jun 11, 2019 at 3:40 PM Wes McKinney <wesmck...@gmail.com> > > wrote: > > > > > > > Hi Tim, > > > > > > > > I'd ideally like to see the work done in the Arrow C++ library so that > > it > > > > can be utilized by all the C++ "binders" (Python, R, C, Ruby, MATLAB). > > > This > > > > also means a larger labor pool of individuals to help improve and > > > maintain > > > > the software. There was a stalled PR around this a time back (check out > > > the > > > > Arrow Closed PR queue) that got stuck on some limitations in avro-c. It > > > > might be more expedient to fork parts of Apache Avro and do all the > > > > development inside a single codebase. > > > > > > > > There's a lot of folks that can provide feedback should you choose to > > go > > > > down this route. > > > > > > > > Thanks > > > > Wes > > > > > > > > On Tue, Jun 11, 2019, 4:53 PM Tim Swast <sw...@google.com.invalid> > > > wrote: > > > > > > > > > Hi Arrow and Avro devs, > > > > > > > > > > I've been investigating some performance issues with the BigQuery > > > Storage > > > > > API (https://github.com/googleapis/google-cloud-python/issues/7805), > > > and > > > > > have identified that the vast majority of time is spent decoding Avro > > > > into > > > > > pandas dataframes. > > > > > <https://github.com/googleapis/google-cloud-python/issues/7805> > > > > > I've done some initial experiments by hand written parsers (inspired > > by > > > > > https://techblog.rtbhouse.com/2017/04/18/fast-avro/) and have seen a > > > > > dramatic improvement in time spent parsing. > > > > > > > > > > I'm considering releasing this as a separate package for the > > following > > > > > reasons: > > > > > > > > > > - Code generation + Numba is a bit of an unproven technique for > > > > parsers, > > > > > so I'd like to treat this as an experiment rather than "the" > > package > > > > to > > > > > use > > > > > to parse Avro from Python. > > > > > - I don't need to handle the full Avro spec for this experiment. > > > > > Importantly, BQ Storage API only uses a schemaless reader (since > > the > > > > > schema > > > > > is output only once, and omitted for subsequent protobuf messages) > > > and > > > > > doesn't use any compression. > > > > > > > > > > That said, I'm open to contributing this to either pyarrow or avro if > > > > > there's interest. > > > > > > > > > > If the answer is "no" (as I suspect it is) and I don't contribute it > > > now, > > > > > the package will be clearly identified as a fork of the Apache Avro > > > > project > > > > > and licensed Apache 2.0, so it should be easy to pull in once the > > > > > techniques are proven. > > > > > > > > > > * • **Tim Swast* > > > > > * • *Software Friendliness Engineer > > > > > * • *Google Cloud Developer Relations > > > > > * • *Seattle, WA, USA > > > > > > > > > > > > -- > > > * • **Tim Swast* > > > * • *Software Friendliness Engineer > > > * • *Google Cloud Developer Relations > > > * • *Seattle, WA, USA > > > > >