hi folks,

Spurred by the discussion and bugfix for PARQUET-799, I'd like to do
something about the IO interfaces that we currently have implemented
in parquet-cpp.

For C++ at least, the Parquet project is not an ideal place to be
maintaining cross-platform IO and memory management. There are
portability and concurrent access issues we will eventually need to
deal with to make parquet-cpp work well in diverse production
environments.

In parallel, we've been developing a general, low-overhead IO
subsystem inside Apache Arrow:

https://github.com/apache/arrow/tree/master/cpp/src/arrow/io

Since Arrow is about in-memory columnar data structures and efficient
IO / RPC / IPC, this is a much more appropriate place to maintain such
code (in the absence of a sort of "Apache C++ Commons" library).
There, we currently have more mature implementations of:

- Operating system files (which also work on Windows)
- Memory mapped files
- HDFS (either using libhdfs or libhdfs3 at your choosing)

Additionally, the "Buffer" abstraction (which handles memory lifetime
and provides a general-purpose way to pass around a block of memory
which may or may not be owned by the application) is implemented in
both Parquet [1] and Arrow [2].

Since, fundamentally, parquet-cpp is a library for encoding and
decoding the Parquet file format rather than general purpose IO /
file-like interfaces, I propose that we excise this code from the
library and make Arrow a hard dependency in libparquet. I believe our
respective developer communities would benefit from a hardening of the
IO and memory interfaces that are being developed in Arrow, and it
will lead to better quality software and reduced fragmentation.

I wanted to bring this up as we are on the cusp of making the first
ASF release of parquet-cpp, and while this work might not make the cut
for 0.1, if we agree it's a good idea it would be good to do it sooner
rather than later.

Thanks and happy holidays / best wishes for 2017,
Wes

[1]: https://github.com/apache/parquet-cpp/blob/master/src/parquet/util/buffer.h
[2]: https://github.com/apache/arrow/blob/master/cpp/src/arrow/buffer.h

Reply via email to