reader API

Micah Kornfield Sat, 07 Mar 2020 14:55:07 -0800

The current API for writing repetition and definition levels takes arrays
of int16_t values for the levels.  This seems inefficient when there are
"runs" of the same level.  When there are runs, writers that don't already
have the data in array form, need to explode it out to an array which
ultimately gets run-length encoded again via RLEEncoder [1]. A similar
inefficiency exists when reading.


I was wondering if the community has considered adding APIs to the
column_writer/reader [2][3] that expose the encoder/decoder directly or a
facade around them?

At least for encoding it seems like an extra wrapper for RLEEncoder would
be needed since the required memory size wouldn't be known up front.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h
[2]
https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158
[3]
https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151

[C++] Adding RLEEncoder/Decoder to parquet API column writer/reader API

Reply via email to