Re: [PARQUET-CPP] Writing hierarchical schema to a parquet

Ryan Blue Wed, 17 Jan 2018 10:41:11 -0800

I recently started working on a new set of value readers in Java that
support hierarchical schemas. I ended up with some code that's a lot easier
to read than the current Java version, and is slightly faster (at least for
my Avro tests). It may be helpful for this work on the C++ side.


Here's the list reader implementation:
https://github.com/Netflix/iceberg/blob/parquet-value-readers/parquet/src/main/java/com/netflix/iceberg/parquet/ParquetValueReaders.java#L172

rb

On Wed, Jan 17, 2018 at 9:41 AM, Wes McKinney <[email protected]> wrote:

> This work would only involve the Arrow interface in src/parquet/arrow
> (converting from Arrow representation to repetition/definition level
> encoding, and back), so you wouldn't need to master the whole Parquet
> codebase, at least. I'd like to help with this work, but realistically
> I won't have bandwidth for it until February or more likely March
> sometime.
>
> - Wes
>
> On Wed, Jan 17, 2018 at 10:11 AM, Jim Pivarski <[email protected]>
> wrote:
> > I also have a use-case that requires lists-of-structs and encountered
> that
> > limitation in pyarrow. Just one level deep would enable a lot of HEP
> data.
> >
> > I've worked out the logic of converting Parquet definition and repetition
> > levels into Arrow-style arrays:
> >
> > https://github.com/diana-hep/oamap/blob/master/oamap/
> source/parquet.py#L604
> > https://github.com/diana-hep/oamap/blob/master/oamap/
> source/parquet.py#L238
> >
> >
> > which is subtle because record nullability and list lengths are
> > intertwined. (Repetition levels, by themselves, cannot encode empty
> lists,
> > so they do it through an interaction with definition levels.) I also
> have a
> > suite of artificial samples that test combinations of these features:
> >
> > https://github.com/diana-hep/oamap/tree/master/tests/samples
> >
> >
> > It's hard for me to imagine diving into a new codebase (Parquet C++) and
> > adding this feature on my own, but I'd be willing to work with someone
> who
> > is familiar with it, knows which regions of the code need to be changed,
> > and can work in parallel with me remotely. The translation from
> intertwined
> > definition and repetition levels to Arrow's separate arrays for each
> level
> > of structure was not easy, and I'd like to spread this knowledge now that
> > my implementation seems to work.
> >
> > Anyone interested in teaming up?
> > -- Jim
> >
> >
> >
> > On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <[email protected]>
> wrote:
> >
> >> hi Andrei,
> >>
> >> We are in need of development assistance in the Parquet C++ project
> >> (https://github.com/apache/parquet-cpp) implementing complete support
> >> for reading and writing nested Arrow data. We only support simple
> >> structs (and structs of structs) and lists (and lists of lists) at the
> >> moment. It's something I'd like to get done in 2018 if no one else
> >> gets there first, but it isn't enough of a priority for me personally
> >> right now to guarantee any kind of timeline.
> >>
> >> Thanks
> >> Wes
> >>
> >> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <[email protected]> wrote:
> >> > We would like to use a combination of Arrow and Parquet to store
> >> JSON-like
> >> > hierarchical data. We have a problem of understanding how to properly
> >> > serialize it.
> >> >
> >> > Our current workflow:
> >> > 1. We create hierarchical arrow::Schema
> >> > 2. Then we create matching arrow::RecordBatchBuilder (with
> >> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy
> of
> >> > ArrayBuilders of various types
> >> > 3. Then we serialize all our documents one by one into
> >> RecordBatchBuilder by
> >> > walking simultanously through a document and ArrayBuilder hierarchies.
> >> > 5. Then we convert resulting RecordBatch to a Table and try to save
> it to
> >> > parquet file with parquet::arrow::FileWriter::WriteTable().
> >> >
> >> > But at this moment serialization fails with an error "Invalid: Nested
> >> column
> >> > branch had multiple children". We also tried to avoid converting to a
> >> Table
> >> > and save root column (StructArray) directly with
> >> > parquet::arrow::FileWriter::WriteColumnChunk with the same result.
> >> >
> >> > By looking at writer.cc code, it seems that it expects a flat list of
> >> columns.
> >> > So, there should be step #4 that converts a hierachical RecordBatch
> to a
> >> flat
> >> > RecordBatch. For example, such hierarchical schema
> >> >
> >> > struct {
> >> >   struct {
> >> >     int64;
> >> >     list {
> >> >       string;
> >> >     }
> >> >   }
> >> >   float;
> >> > }
> >> >
> >> > should be flattened into such flat schema consisting of three
> top-level
> >> fields:
> >> >
> >> > struct {
> >> >   struct {
> >> >     int64;
> >> >   }
> >> > },
> >> > struct {
> >> >   struct {
> >> >     list {
> >> >       string;
> >> >     }
> >> >   }
> >> > },
> >> > struct {
> >> >   float;
> >> > }
> >> >
> >> > I am curious whether we are going in the right direction. If yes, do
> we
> >> need
> >> > to write converter manually or is there any existing code that does
> that?
> >> >
> >> > We use master::HEAD versions of Arrow and Parquet.
> >> >
> >> >
> >> >
> >>
>



-- 
Ryan Blue
Software Engineer
Netflix

Re: [PARQUET-CPP] Writing hierarchical schema to a parquet

Reply via email to