hi Tim,

I should think that the reader API should support deserializing a blob
of schemaless Avro records as an Arrow record batch, or even feeding
the reader one serialized record at a time to build a record batch
incrementally

- Wes

On Wed, Jun 12, 2019 at 1:25 PM Tim Swast <sw...@google.com.invalid> wrote:
>
> > Let me know if you want to collaborate on it.
>
> Thanks Micah.
>
> What are your thoughts on reading schemaless Avro bytes? One of the reasons
> I have started experimenting with the fork is that fastavro had trouble
> reading more than one row at a time from a schemaless reader.
>
> *  •  **Tim Swast*
> *  •  *Software Friendliness Engineer
> *  •  *Google Cloud Developer Relations
> *  •  *Seattle, WA, USA
>
>
> On Tue, Jun 11, 2019 at 10:29 PM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
> > Hi Tim,
> > The avro support in C++ has been on my backlog for a while.  I'm going to
> > try to take the first few steps towards this over the next couple of days.
> > Let me know if you want to collaborate on it.  C++ is a lot nicer now then
> > it was 8 years ago :)
> >
> > Cheers,
> > Micah
> >
> > On Tue, Jun 11, 2019 at 6:40 PM Tim Swast <sw...@google.com.invalid>
> > wrote:
> >
> > > Thanks for the advice, Wes.
> > >
> > > Unfortunately, I am about 8 years out of practice for writing any C++
> > > (which was part of the appeal of numba to me). Sounds like I should
> > refresh
> > > my skills. I like the idea of write one, have good performance
> > everywhere.
> > >
> > > On Tue, Jun 11, 2019 at 3:40 PM Wes McKinney <wesmck...@gmail.com>
> > wrote:
> > >
> > > > Hi Tim,
> > > >
> > > > I'd ideally like to see the work done in the Arrow C++ library so that
> > it
> > > > can be utilized by all the C++ "binders" (Python, R, C, Ruby, MATLAB).
> > > This
> > > > also means a larger labor pool of individuals to help improve and
> > > maintain
> > > > the software. There was a stalled PR around this a time back (check out
> > > the
> > > > Arrow Closed PR queue) that got stuck on some limitations in avro-c. It
> > > > might be more expedient to fork parts of Apache Avro and do all the
> > > > development inside a single codebase.
> > > >
> > > > There's a lot of folks that can provide feedback should you choose to
> > go
> > > > down this route.
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > On Tue, Jun 11, 2019, 4:53 PM Tim Swast <sw...@google.com.invalid>
> > > wrote:
> > > >
> > > > > Hi Arrow and Avro devs,
> > > > >
> > > > > I've been investigating some performance issues with the BigQuery
> > > Storage
> > > > > API (https://github.com/googleapis/google-cloud-python/issues/7805),
> > > and
> > > > > have identified that the vast majority of time is spent decoding Avro
> > > > into
> > > > > pandas dataframes.
> > > > > <https://github.com/googleapis/google-cloud-python/issues/7805>
> > > > > I've done some initial experiments by hand written parsers (inspired
> > by
> > > > > https://techblog.rtbhouse.com/2017/04/18/fast-avro/) and have seen a
> > > > > dramatic improvement in time spent parsing.
> > > > >
> > > > > I'm considering releasing this as a separate package for the
> > following
> > > > > reasons:
> > > > >
> > > > >    - Code generation + Numba is a bit of an unproven technique for
> > > > parsers,
> > > > >    so I'd like to treat this as an experiment rather than "the"
> > package
> > > > to
> > > > > use
> > > > >    to parse Avro from Python.
> > > > >    - I don't need to handle the full Avro spec for this experiment.
> > > > >    Importantly, BQ Storage API only uses a schemaless reader (since
> > the
> > > > > schema
> > > > >    is output only once, and omitted for subsequent protobuf messages)
> > > and
> > > > >    doesn't use any compression.
> > > > >
> > > > > That said, I'm open to contributing this to either pyarrow or avro if
> > > > > there's interest.
> > > > >
> > > > > If the answer is "no" (as I suspect it is) and I don't contribute it
> > > now,
> > > > > the package will be clearly identified as a fork of the Apache Avro
> > > > project
> > > > > and licensed Apache 2.0, so it should be easy to pull in once the
> > > > > techniques are proven.
> > > > >
> > > > > *  •  **Tim Swast*
> > > > > *  •  *Software Friendliness Engineer
> > > > > *  •  *Google Cloud Developer Relations
> > > > > *  •  *Seattle, WA, USA
> > > > >
> > > >
> > > --
> > > *  •  **Tim Swast*
> > > *  •  *Software Friendliness Engineer
> > > *  •  *Google Cloud Developer Relations
> > > *  •  *Seattle, WA, USA
> > >
> >

Reply via email to