We would like to use a combination of Arrow and Parquet to store JSON-like
hierarchical data. We have a problem of understanding how to properly
serialize it.
Our current workflow:
1. We create hierarchical arrow::Schema
2. Then we create matching arrow::RecordBatchBuilder (with
arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of
ArrayBuilders of various types
3. Then we serialize all our documents one by one into RecordBatchBuilder by
walking simultanously through a document and ArrayBuilder hierarchies.
5. Then we convert resulting RecordBatch to a Table and try to save it to
parquet file with parquet::arrow::FileWriter::WriteTable().
But at this moment serialization fails with an error "Invalid: Nested column
branch had multiple children". We also tried to avoid converting to a Table
and save root column (StructArray) directly with
parquet::arrow::FileWriter::WriteColumnChunk with the same result.
By looking at writer.cc code, it seems that it expects a flat list of columns.
So, there should be step #4 that converts a hierachical RecordBatch to a flat
RecordBatch. For example, such hierarchical schema
struct {
struct {
int64;
list {
string;
}
}
float;
}
should be flattened into such flat schema consisting of three top-level fields:
struct {
struct {
int64;
}
},
struct {
struct {
list {
string;
}
}
},
struct {
float;
}
I am curious whether we are going in the right direction. If yes, do we need
to write converter manually or is there any existing code that does that?
We use master::HEAD versions of Arrow and Parquet.