Re: Evaluating options for an efficient C++ implementation of Avro for Apache Arrow users

Wes McKinney Tue, 25 Jun 2019 18:14:22 -0700

hi all,

Is there any feedback on this discussion? We are interested in
beginning to move forward on this project.


Thanks,
Wes

On Tue, Jun 18, 2019 at 11:12 AM Wes McKinney <[email protected]> wrote:
>
> hi folks,
>
> (cross-posting to dev@avro and dev@arrow -- please subscribe to both
> mailing lists to participate in the thread)
>
> In the Apache Arrow community, we are striving to develop optimized,
> batch-oriented interfaces C++ to read and write various open standard
> file formats, such as
>
> - Parquet
> - ORC
> - CSV
> - Line-delimited JSON
> - Avro
>
> The Arrow C++ codebase has been co-developed with the Parquet C++
> codebase by many of the same inidividuals, so that's our most mature
> implementation, but we also have ORC, CSV, and JSON now in various
> states of maturity and performance.
>
> Since Arrow is a columnar format, the intention is to work with a
> batch of records at a time, such as 64K records or so -- efficient
> deserializing into a columnar batch requires a certain design approach
> that general purpose libraries cannot always easily accommodate.
>
> There is interest in working on Avro support and so we've (primarily
> Micah Kornfield, though I've been eyeing the project myself for some
> time) been investigating approaches to the project that are pragmatic
> and likely to yield good results. Some options to consider:
>
> * A new designed-for-Arrow Avro implementation in C++
> * Using avro-c as a library and contributing patches upstream
> * Using avro-c++ as a library and contributing patches upstream
> * Forking avro-c or avro-c++ and modifying at will for use in Apache Arrow
>
> The intended users for this software are not only C++ developers but
> also languages that bind the C++ libraries, including Python, R, Ruby,
> and MATLAB. So this software is of high importance to very large
> programmer communities -- currently the quality (in terms of
> performance or usability) of Avro software in these languages is
> relatively poor (consider, for instance, that there are no fewer than
> 4 Avro libraries for Python -- avro, fastavro, uavro, and cyavro).
>
> Our current inclination is that forking avro-c++ into the Arrow
> codebase is the preferred approach for a number of reasons:
>
> * We are already using C++11, and so using C++ as a starting point is
> preferable to C
> * Decoupling from Apache Avro release cycles: Arrow is about to have
> its 14th major release in a little over 3 years -- our release cadence
> is approximately every 2 to 3 months. It also spares us having to
> manage Avro as a third party build dependency
> * Freedom to refactor serialization and deserialization paths to
> feature Arrow-specific optimizations and batch-centric APIs
> * Desire to remove Avro-specific memory management and IO interfaces
> and use common Arrow ones (also used in Parquet C++ and the
> Arrow-centric CSV and JSON libraries)
> * Interest in developing Arrow-centric LLVM code generation for
> optimized decoding of records
>
> We understand that forking a codebase is not a decision that should be
> undertaken flippantly and so we'd like to collect feedback from the
> Avro community and the C++ developers in particular about this
> project, which is currently at the "codebase import stage" [1]
>
> To head off one possible question, I do not think that developing
> Arrow specializations _inside_ apache/avro is a desirable option as it
> would introduce a circular dependency between codebases as we wish to
> develop bindings for Avro+Arrow in Python, R, Ruby, etc. (these are
> found in apache/arrow). We did this for more than 2 years with Parquet
> in apache/parquet-cpp and the development process (CI, testing,
> packaging) was deeply unpleasant for Arrow and Parquet alike.
>
> Thank you,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/4585

Re: Evaluating options for an efficient C++ implementation of Avro for Apache Arrow users

Reply via email to