[PARQUET-CPP] Writing hierarchical schema to a parquet

Andrei Gudkov Wed, 03 Jan 2018 01:05:25 -0800

We would like to use a combination of Arrow and Parquet to store JSON-like 
hierarchical data. We have a problem of understanding how to properly 
serialize it.


Our current workflow:
1. We create hierarchical arrow::Schema
2. Then we create matching arrow::RecordBatchBuilder (with 
arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of 
ArrayBuilders of various types
3. Then we serialize all our documents one by one into RecordBatchBuilder by 
walking simultanously through a document and ArrayBuilder hierarchies.
5. Then we convert resulting RecordBatch to a Table and try to save it to 
parquet file with parquet::arrow::FileWriter::WriteTable().
   
But at this moment serialization fails with an error "Invalid: Nested column 
branch had multiple children". We also tried to avoid converting to a Table 
and save root column (StructArray) directly with 
parquet::arrow::FileWriter::WriteColumnChunk with the same result.

By looking at writer.cc code, it seems that it expects a flat list of columns. 
So, there should be step #4 that converts a hierachical RecordBatch to a flat 
RecordBatch. For example, such hierarchical schema

struct {
  struct {
    int64;
    list {
      string;
    }
  }
  float;
}

should be flattened into such flat schema consisting of three top-level fields:

struct {
  struct {
    int64;
  }
},
struct {
  struct {
    list {
      string;
    }
  }
},
struct {
  float;
}

I am curious whether we are going in the right direction. If yes, do we need 
to write converter manually or is there any existing code that does that?

We use master::HEAD versions of Arrow and Parquet.

[PARQUET-CPP] Writing hierarchical schema to a parquet

Reply via email to