Hi, I'd like to use parquet-cpp for High Energy Physics (HEP) and possibly contribute to the core to support that use-case, but I'm having trouble determining the status of the C++ project.
Most HEP data is stored in the ROOT file format ( https://root.cern.ch/root/InputOutput.html), which represents complex, nested, cross-referenced C++ objects with a columnar layout so that a subset of fields can be individually read, individually compressed, and quickly scanned. I believe that these benefits can be satisfied by Parquet, with the additional benefit that it's a standard with a specification that can be read or written in multiple languages. (Parquet can't be used as a random-writable object database, but this feature of ROOT isn't widely used.) To convert between ROOT and Parquet, I would need to implement ROOT's "StreamerInfo" object schema (https://root.cern.ch/root/SchemaEvolution.html) into a Logical Type Definition, on par with AvroRecordReader, but also supporting pointer references (as an Int64 -> object map). Parquet C++'s TODO (https://github.com/apache/parquet-cpp/blob/master/TODO) states that this record abstraction, as well as nested schemas and file-writing, haven't been implemented. However, the TODO is also 2 years old, where I see a burst of activity this year in GitHub. Is the TODO out of date? Will any of the core developers be at KDD16 (http://www.kdd.org/kdd2016/) or elsewhere in San Francisco on August 15 or 16? If so, could we meet in person so that we can talk in detail about where the hooks I'm looking for are and how I can contribute? (Or *when* I should contribute, if there's a major refactoring in the works.) Thanks! -- Jim
