The current API for writing repetition and definition levels takes arrays
of int16_t values for the levels.  This seems inefficient when there are
"runs" of the same level.  When there are runs, writers that don't already
have the data in array form, need to explode it out to an array which
ultimately gets run-length encoded again via RLEEncoder [1]. A similar
inefficiency exists when reading.

I was wondering if the community has considered adding APIs to the
column_writer/reader [2][3] that expose the encoder/decoder directly or a
facade around them?

At least for encoding it seems like an extra wrapper for RLEEncoder would
be needed since the required memory size wouldn't be known up front.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h
[2]
https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158
[3]
https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151

Reply via email to