Hi,

I'd like to use parquet-cpp for High Energy Physics (HEP) and possibly
contribute to the core to support that use-case, but I'm having trouble
determining the status of the C++ project.

Most HEP data is stored in the ROOT file format (
https://root.cern.ch/root/InputOutput.html), which represents complex,
nested, cross-referenced C++ objects with a columnar layout so that a
subset of fields can be individually read, individually compressed, and
quickly scanned. I believe that these benefits can be satisfied by Parquet,
with the additional benefit that it's a standard with a specification that
can be read or written in multiple languages. (Parquet can't be used as a
random-writable object database, but this feature of ROOT isn't widely
used.)

To convert between ROOT and Parquet, I would need to implement ROOT's
"StreamerInfo" object schema (https://root.cern.ch/root/SchemaEvolution.html)
into a Logical Type Definition, on par with AvroRecordReader, but also
supporting pointer references (as an Int64 -> object map).

Parquet C++'s TODO (https://github.com/apache/parquet-cpp/blob/master/TODO)
states that this record abstraction, as well as nested schemas and
file-writing, haven't been implemented. However, the TODO is also 2 years
old, where I see a burst of activity this year in GitHub. Is the TODO out
of date?

Will any of the core developers be at KDD16 (http://www.kdd.org/kdd2016/)
or elsewhere in San Francisco on August 15 or 16? If so, could we meet in
person so that we can talk in detail about where the hooks I'm looking for
are and how I can contribute? (Or *when* I should contribute, if there's a
major refactoring in the works.)

Thanks!
-- Jim

Reply via email to