This work would only involve the Arrow interface in src/parquet/arrow
(converting from Arrow representation to repetition/definition level
encoding, and back), so you wouldn't need to master the whole Parquet
codebase, at least. I'd like to help with this work, but realistically
I won't have bandwidth for it until February or more likely March
sometime.

- Wes

On Wed, Jan 17, 2018 at 10:11 AM, Jim Pivarski <[email protected]> wrote:
> I also have a use-case that requires lists-of-structs and encountered that
> limitation in pyarrow. Just one level deep would enable a lot of HEP data.
>
> I've worked out the logic of converting Parquet definition and repetition
> levels into Arrow-style arrays:
>
> https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L604
> https://github.com/diana-hep/oamap/blob/master/oamap/source/parquet.py#L238
>
>
> which is subtle because record nullability and list lengths are
> intertwined. (Repetition levels, by themselves, cannot encode empty lists,
> so they do it through an interaction with definition levels.) I also have a
> suite of artificial samples that test combinations of these features:
>
> https://github.com/diana-hep/oamap/tree/master/tests/samples
>
>
> It's hard for me to imagine diving into a new codebase (Parquet C++) and
> adding this feature on my own, but I'd be willing to work with someone who
> is familiar with it, knows which regions of the code need to be changed,
> and can work in parallel with me remotely. The translation from intertwined
> definition and repetition levels to Arrow's separate arrays for each level
> of structure was not easy, and I'd like to spread this knowledge now that
> my implementation seems to work.
>
> Anyone interested in teaming up?
> -- Jim
>
>
>
> On Wed, Jan 10, 2018 at 7:36 PM, Wes McKinney <[email protected]> wrote:
>
>> hi Andrei,
>>
>> We are in need of development assistance in the Parquet C++ project
>> (https://github.com/apache/parquet-cpp) implementing complete support
>> for reading and writing nested Arrow data. We only support simple
>> structs (and structs of structs) and lists (and lists of lists) at the
>> moment. It's something I'd like to get done in 2018 if no one else
>> gets there first, but it isn't enough of a priority for me personally
>> right now to guarantee any kind of timeline.
>>
>> Thanks
>> Wes
>>
>> On Wed, Jan 3, 2018 at 4:04 AM, Andrei Gudkov <[email protected]> wrote:
>> > We would like to use a combination of Arrow and Parquet to store
>> JSON-like
>> > hierarchical data. We have a problem of understanding how to properly
>> > serialize it.
>> >
>> > Our current workflow:
>> > 1. We create hierarchical arrow::Schema
>> > 2. Then we create matching arrow::RecordBatchBuilder (with
>> > arrow::RecordBatchBuilder::Make()), which effectively is a hierarchy of
>> > ArrayBuilders of various types
>> > 3. Then we serialize all our documents one by one into
>> RecordBatchBuilder by
>> > walking simultanously through a document and ArrayBuilder hierarchies.
>> > 5. Then we convert resulting RecordBatch to a Table and try to save it to
>> > parquet file with parquet::arrow::FileWriter::WriteTable().
>> >
>> > But at this moment serialization fails with an error "Invalid: Nested
>> column
>> > branch had multiple children". We also tried to avoid converting to a
>> Table
>> > and save root column (StructArray) directly with
>> > parquet::arrow::FileWriter::WriteColumnChunk with the same result.
>> >
>> > By looking at writer.cc code, it seems that it expects a flat list of
>> columns.
>> > So, there should be step #4 that converts a hierachical RecordBatch to a
>> flat
>> > RecordBatch. For example, such hierarchical schema
>> >
>> > struct {
>> >   struct {
>> >     int64;
>> >     list {
>> >       string;
>> >     }
>> >   }
>> >   float;
>> > }
>> >
>> > should be flattened into such flat schema consisting of three top-level
>> fields:
>> >
>> > struct {
>> >   struct {
>> >     int64;
>> >   }
>> > },
>> > struct {
>> >   struct {
>> >     list {
>> >       string;
>> >     }
>> >   }
>> > },
>> > struct {
>> >   float;
>> > }
>> >
>> > I am curious whether we are going in the right direction. If yes, do we
>> need
>> > to write converter manually or is there any existing code that does that?
>> >
>> > We use master::HEAD versions of Arrow and Parquet.
>> >
>> >
>> >
>>

Reply via email to