I recently started working on a new set of value readers in Java that support hierarchical schemas. I ended up with some code that's a lot easier to read than the current Java version, and is slightly faster (at least for my Avro tests). It may be helpful for this work on the C++ side.
Here's the list reader implementation: https://github.com/Netflix/iceberg/blob/parquet-value-readers/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetValueReaders.java#L172 rb On Wed, Jan 17, 2018 at 9:41 AM, Wes McKinney <[email protected]> wrote: > This work would only involve the Arrow interface in src/parquet/arrow > (converting from Arrow representation to repetition/definition level > encoding, and back), so you wouldn't need to master the whole Parquet > codebase, at least. I'd like to help with this work, but realistically > I won't have bandwidth for it until February or more likely March > sometime. > > - Wes > > On Wed, Jan 17, 2018 at 10:11 AM, Jim Pivarski <[email protected]> > wrote: > > I also have a use-case that requires lists-of-structs and encountered > that > > limitation in pyarrow. Just one level deep would enable a lot of HEP > data. > > > > I've worked out the logic of converting Parquet definition and repetition > > levels into Arrow-style arrays: > > > > https://github.com/diana-hep/oamap/blob/master/oamap/ > source/parquet.py#L604 > > https://github.com/diana-hep/oamap/blob/master/oamap/ > source/parquet.py#L238 > > > > > > which is subtle because record nullability and list lengths are > > intertwined. (Repetition levels, by themselves, cannot encode empty > lists, > > so they do it through an interaction with definition levels.) I also > have a > > suite of artificial samples that test combinations of these features: > > > > https://github.com/diana-hep/oamap/tree/master/tests/samples > > > > > > It's hard for me to imagine diving into a new codebase (Parquet C++) and > > adding this feature on my own, but I'd be willing to work with someone > who > > is familiar with it, knows which regions of the code need to be changed, > > and can work in parallel with me remotely. The translation from > intertwined > > definition and repetition levels to Arrow's separate arrays for each > level > > of structure was not easy, and I'd like to spread this knowledge now that > > my implementation seems to work. > > > > Anyone interested in teaming up? > > -- Jim > > > > > > > > On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <[email protected]> > wrote: > > > >> hi Andrei, > >> > >> We are in need of development assistance in the Parquet C++ project > >> (https://github.com/apache/parquet-cpp) implementing complete support > >> for reading and writing nested Arrow data. We only support simple > >> structs (and structs of structs) and lists (and lists of lists) at the > >> moment. It's something I'd like to get done in 2018 if no one else > >> gets there first, but it isn't enough of a priority for me personally > >> right now to guarantee any kind of timeline. > >> > >> Thanks > >> Wes > >> > >> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <[email protected]> wrote: > >> > We would like to use a combination of Arrow and Parquet to store > >> JSON-like > >> > hierarchical data. We have a problem of understanding how to properly > >> > serialize it. > >> > > >> > Our current workflow: > >> > 1. We create hierarchical arrow::Schema > >> > 2. Then we create matching arrow::RecordBatchBuilder (with > >> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy > of > >> > ArrayBuilders of various types > >> > 3. Then we serialize all our documents one by one into > >> RecordBatchBuilder by > >> > walking simultanously through a document and ArrayBuilder hierarchies. > >> > 5. Then we convert resulting RecordBatch to a Table and try to save > it to > >> > parquet file with parquet::arrow::FileWriter::WriteTable(). > >> > > >> > But at this moment serialization fails with an error "Invalid: Nested > >> column > >> > branch had multiple children". We also tried to avoid converting to a > >> Table > >> > and save root column (StructArray) directly with > >> > parquet::arrow::FileWriter::WriteColumnChunk with the same result. > >> > > >> > By looking at writer.cc code, it seems that it expects a flat list of > >> columns. > >> > So, there should be step #4 that converts a hierachical RecordBatch > to a > >> flat > >> > RecordBatch. For example, such hierarchical schema > >> > > >> > struct { > >> > struct { > >> > int64; > >> > list { > >> > string; > >> > } > >> > } > >> > float; > >> > } > >> > > >> > should be flattened into such flat schema consisting of three > top-level > >> fields: > >> > > >> > struct { > >> > struct { > >> > int64; > >> > } > >> > }, > >> > struct { > >> > struct { > >> > list { > >> > string; > >> > } > >> > } > >> > }, > >> > struct { > >> > float; > >> > } > >> > > >> > I am curious whether we are going in the right direction. If yes, do > we > >> need > >> > to write converter manually or is there any existing code that does > that? > >> > > >> > We use master::HEAD versions of Arrow and Parquet. > >> > > >> > > >> > > >> > -- Ryan Blue Software Engineer Netflix
