hi Andrei,

We are in need of development assistance in the Parquet C++ project
(https://github.com/apache/parquet-cpp) implementing complete support
for reading and writing nested Arrow data. We only support simple
structs (and structs of structs) and lists (and lists of lists) at the
moment. It's something I'd like to get done in 2018 if no one else
gets there first, but it isn't enough of a priority for me personally
right now to guarantee any kind of timeline.

Thanks
Wes

On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <[email protected]> wrote:
> We would like to use a combination of Arrow and Parquet to store JSON-like
> hierarchical data. We have a problem of understanding how to properly
> serialize it.
>
> Our current workflow:
> 1. We create hierarchical arrow::Schema
> 2. Then we create matching arrow::RecordBatchBuilder (with
> arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of
> ArrayBuilders of various types
> 3. Then we serialize all our documents one by one into RecordBatchBuilder by
> walking simultanously through a document and ArrayBuilder hierarchies.
> 5. Then we convert resulting RecordBatch to a Table and try to save it to
> parquet file with parquet::arrow::FileWriter::WriteTable().
>
> But at this moment serialization fails with an error "Invalid: Nested column
> branch had multiple children". We also tried to avoid converting to a Table
> and save root column (StructArray) directly with
> parquet::arrow::FileWriter::WriteColumnChunk with the same result.
>
> By looking at writer.cc code, it seems that it expects a flat list of columns.
> So, there should be step #4 that converts a hierachical RecordBatch to a flat
> RecordBatch. For example, such hierarchical schema
>
> struct {
>   struct {
>     int64;
>     list {
>       string;
>     }
>   }
>   float;
> }
>
> should be flattened into such flat schema consisting of three top-level 
> fields:
>
> struct {
>   struct {
>     int64;
>   }
> },
> struct {
>   struct {
>     list {
>       string;
>     }
>   }
> },
> struct {
>   float;
> }
>
> I am curious whether we are going in the right direction. If yes, do we need
> to write converter manually or is there any existing code that does that?
>
> We use master::HEAD versions of Arrow and Parquet.
>
>
>

Reply via email to