hi all, Is there any feedback on this discussion? We are interested in beginning to move forward on this project.
Thanks, Wes On Tue, Jun 18, 2019 at 11:12 AM Wes McKinney <[email protected]> wrote: > > hi folks, > > (cross-posting to dev@avro and dev@arrow -- please subscribe to both > mailing lists to participate in the thread) > > In the Apache Arrow community, we are striving to develop optimized, > batch-oriented interfaces C++ to read and write various open standard > file formats, such as > > - Parquet > - ORC > - CSV > - Line-delimited JSON > - Avro > > The Arrow C++ codebase has been co-developed with the Parquet C++ > codebase by many of the same inidividuals, so that's our most mature > implementation, but we also have ORC, CSV, and JSON now in various > states of maturity and performance. > > Since Arrow is a columnar format, the intention is to work with a > batch of records at a time, such as 64K records or so -- efficient > deserializing into a columnar batch requires a certain design approach > that general purpose libraries cannot always easily accommodate. > > There is interest in working on Avro support and so we've (primarily > Micah Kornfield, though I've been eyeing the project myself for some > time) been investigating approaches to the project that are pragmatic > and likely to yield good results. Some options to consider: > > * A new designed-for-Arrow Avro implementation in C++ > * Using avro-c as a library and contributing patches upstream > * Using avro-c++ as a library and contributing patches upstream > * Forking avro-c or avro-c++ and modifying at will for use in Apache Arrow > > The intended users for this software are not only C++ developers but > also languages that bind the C++ libraries, including Python, R, Ruby, > and MATLAB. So this software is of high importance to very large > programmer communities -- currently the quality (in terms of > performance or usability) of Avro software in these languages is > relatively poor (consider, for instance, that there are no fewer than > 4 Avro libraries for Python -- avro, fastavro, uavro, and cyavro). > > Our current inclination is that forking avro-c++ into the Arrow > codebase is the preferred approach for a number of reasons: > > * We are already using C++11, and so using C++ as a starting point is > preferable to C > * Decoupling from Apache Avro release cycles: Arrow is about to have > its 14th major release in a little over 3 years -- our release cadence > is approximately every 2 to 3 months. It also spares us having to > manage Avro as a third party build dependency > * Freedom to refactor serialization and deserialization paths to > feature Arrow-specific optimizations and batch-centric APIs > * Desire to remove Avro-specific memory management and IO interfaces > and use common Arrow ones (also used in Parquet C++ and the > Arrow-centric CSV and JSON libraries) > * Interest in developing Arrow-centric LLVM code generation for > optimized decoding of records > > We understand that forking a codebase is not a decision that should be > undertaken flippantly and so we'd like to collect feedback from the > Avro community and the C++ developers in particular about this > project, which is currently at the "codebase import stage" [1] > > To head off one possible question, I do not think that developing > Arrow specializations _inside_ apache/avro is a desirable option as it > would introduce a circular dependency between codebases as we wish to > develop bindings for Avro+Arrow in Python, R, Ruby, etc. (these are > found in apache/arrow). We did this for more than 2 years with Parquet > in apache/parquet-cpp and the development process (CI, testing, > packaging) was deeply unpleasant for Arrow and Parquet alike. > > Thank you, > Wes > > [1]: https://github.com/apache/arrow/pull/4585
