Thanks for the feedback Wes.  I'll open up a JIRA and see if I can get to
this after my work on Arrow to/from Parquet.

On Mon, Mar 9, 2020 at 3:10 PM Wes McKinney <[email protected]> wrote:

> hi Micah,
>
> This sounds like a good idea to me. Not sure why we didn't think of it
> before, but better to do this late than never.
>
> - Wes
>
> On Sat, Mar 7, 2020 at 4:54 PM Micah Kornfield <[email protected]>
> wrote:
> >
> > The current API for writing repetition and definition levels takes arrays
> > of int16_t values for the levels.  This seems inefficient when there are
> > "runs" of the same level.  When there are runs, writers that don't
> already
> > have the data in array form, need to explode it out to an array which
> > ultimately gets run-length encoded again via RLEEncoder [1]. A similar
> > inefficiency exists when reading.
> >
> > I was wondering if the community has considered adding APIs to the
> > column_writer/reader [2][3] that expose the encoder/decoder directly or a
> > facade around them?
> >
> > At least for encoding it seems like an extra wrapper for RLEEncoder would
> > be needed since the required memory size wouldn't be known up front.
> >
> > Thanks,
> > Micah
> >
> > [1]
> >
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h
> > [2]
> >
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158
> > [3]
> >
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151
>

Reply via email to