hi Micah,

This sounds like a good idea to me. Not sure why we didn't think of it
before, but better to do this late than never.

- Wes

On Sat, Mar 7, 2020 at 4:54 PM Micah Kornfield <[email protected]> wrote:
>
> The current API for writing repetition and definition levels takes arrays
> of int16_t values for the levels.  This seems inefficient when there are
> "runs" of the same level.  When there are runs, writers that don't already
> have the data in array form, need to explode it out to an array which
> ultimately gets run-length encoded again via RLEEncoder [1]. A similar
> inefficiency exists when reading.
>
> I was wondering if the community has considered adding APIs to the
> column_writer/reader [2][3] that expose the encoder/decoder directly or a
> facade around them?
>
> At least for encoding it seems like an extra wrapper for RLEEncoder would
> be needed since the required memory size wouldn't be known up front.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h
> [2]
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158
> [3]
> https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151

Reply via email to