This work would only involve the Arrow interface in src/parquet/arrow (converting from Arrow representation to repetition/definition level encoding, and back), so you wouldn't need to master the whole Parquet codebase, at least. I'd like to help with this work, but realistically I won't have bandwidth for it until February or more likely March sometime.
- Wes On Wed, Jan 17, 2018 at 10:11 AM, Jim Pivarski <[email protected]> wrote: > I also have a use-case that requires lists-of-structs and encountered that > limitation in pyarrow. Just one level deep would enable a lot of HEP data. > > I've worked out the logic of converting Parquet definition and repetition > levels into Arrow-style arrays: > > https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L604 > https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L238 > > > which is subtle because record nullability and list lengths are > intertwined. (Repetition levels, by themselves, cannot encode empty lists, > so they do it through an interaction with definition levels.) I also have a > suite of artificial samples that test combinations of these features: > > https://github.com/diana-hep/oamap/tree/master/tests/samples > > > It's hard for me to imagine diving into a new codebase (Parquet C++) and > adding this feature on my own, but I'd be willing to work with someone who > is familiar with it, knows which regions of the code need to be changed, > and can work in parallel with me remotely. The translation from intertwined > definition and repetition levels to Arrow's separate arrays for each level > of structure was not easy, and I'd like to spread this knowledge now that > my implementation seems to work. > > Anyone interested in teaming up? > -- Jim > > > > On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <[email protected]> wrote: > >> hi Andrei, >> >> We are in need of development assistance in the Parquet C++ project >> (https://github.com/apache/parquet-cpp) implementing complete support >> for reading and writing nested Arrow data. We only support simple >> structs (and structs of structs) and lists (and lists of lists) at the >> moment. It's something I'd like to get done in 2018 if no one else >> gets there first, but it isn't enough of a priority for me personally >> right now to guarantee any kind of timeline. >> >> Thanks >> Wes >> >> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <[email protected]> wrote: >> > We would like to use a combination of Arrow and Parquet to store >> JSON-like >> > hierarchical data. We have a problem of understanding how to properly >> > serialize it. >> > >> > Our current workflow: >> > 1. We create hierarchical arrow::Schema >> > 2. Then we create matching arrow::RecordBatchBuilder (with >> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of >> > ArrayBuilders of various types >> > 3. Then we serialize all our documents one by one into >> RecordBatchBuilder by >> > walking simultanously through a document and ArrayBuilder hierarchies. >> > 5. Then we convert resulting RecordBatch to a Table and try to save it to >> > parquet file with parquet::arrow::FileWriter::WriteTable(). >> > >> > But at this moment serialization fails with an error "Invalid: Nested >> column >> > branch had multiple children". We also tried to avoid converting to a >> Table >> > and save root column (StructArray) directly with >> > parquet::arrow::FileWriter::WriteColumnChunk with the same result. >> > >> > By looking at writer.cc code, it seems that it expects a flat list of >> columns. >> > So, there should be step #4 that converts a hierachical RecordBatch to a >> flat >> > RecordBatch. For example, such hierarchical schema >> > >> > struct { >> > struct { >> > int64; >> > list { >> > string; >> > } >> > } >> > float; >> > } >> > >> > should be flattened into such flat schema consisting of three top-level >> fields: >> > >> > struct { >> > struct { >> > int64; >> > } >> > }, >> > struct { >> > struct { >> > list { >> > string; >> > } >> > } >> > }, >> > struct { >> > float; >> > } >> > >> > I am curious whether we are going in the right direction. If yes, do we >> need >> > to write converter manually or is there any existing code that does that? >> > >> > We use master::HEAD versions of Arrow and Parquet. >> > >> > >> > >>
