I also have a use-case that requires lists-of-structs and encountered that limitation in pyarrow. Just one level deep would enable a lot of HEP data.
I've worked out the logic of converting Parquet definition and repetition levels into Arrow-style arrays: https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L604 https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L238 which is subtle because record nullability and list lengths are intertwined. (Repetition levels, by themselves, cannot encode empty lists, so they do it through an interaction with definition levels.) I also have a suite of artificial samples that test combinations of these features: https://github.com/diana-hep/oamap/tree/master/tests/samples It's hard for me to imagine diving into a new codebase (Parquet C++) and adding this feature on my own, but I'd be willing to work with someone who is familiar with it, knows which regions of the code need to be changed, and can work in parallel with me remotely. The translation from intertwined definition and repetition levels to Arrow's separate arrays for each level of structure was not easy, and I'd like to spread this knowledge now that my implementation seems to work. Anyone interested in teaming up? -- Jim On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <[email protected]> wrote: > hi Andrei, > > We are in need of development assistance in the Parquet C++ project > (https://github.com/apache/parquet-cpp) implementing complete support > for reading and writing nested Arrow data. We only support simple > structs (and structs of structs) and lists (and lists of lists) at the > moment. It's something I'd like to get done in 2018 if no one else > gets there first, but it isn't enough of a priority for me personally > right now to guarantee any kind of timeline. > > Thanks > Wes > > On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <[email protected]> wrote: > > We would like to use a combination of Arrow and Parquet to store > JSON-like > > hierarchical data. We have a problem of understanding how to properly > > serialize it. > > > > Our current workflow: > > 1. We create hierarchical arrow::Schema > > 2. Then we create matching arrow::RecordBatchBuilder (with > > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of > > ArrayBuilders of various types > > 3. Then we serialize all our documents one by one into > RecordBatchBuilder by > > walking simultanously through a document and ArrayBuilder hierarchies. > > 5. Then we convert resulting RecordBatch to a Table and try to save it to > > parquet file with parquet::arrow::FileWriter::WriteTable(). > > > > But at this moment serialization fails with an error "Invalid: Nested > column > > branch had multiple children". We also tried to avoid converting to a > Table > > and save root column (StructArray) directly with > > parquet::arrow::FileWriter::WriteColumnChunk with the same result. > > > > By looking at writer.cc code, it seems that it expects a flat list of > columns. > > So, there should be step #4 that converts a hierachical RecordBatch to a > flat > > RecordBatch. For example, such hierarchical schema > > > > struct { > > struct { > > int64; > > list { > > string; > > } > > } > > float; > > } > > > > should be flattened into such flat schema consisting of three top-level > fields: > > > > struct { > > struct { > > int64; > > } > > }, > > struct { > > struct { > > list { > > string; > > } > > } > > }, > > struct { > > float; > > } > > > > I am curious whether we are going in the right direction. If yes, do we > need > > to write converter manually or is there any existing code that does that? > > > > We use master::HEAD versions of Arrow and Parquet. > > > > > > >
