hi Micah, This sounds like a good idea to me. Not sure why we didn't think of it before, but better to do this late than never.
- Wes On Sat, Mar 7, 2020 at 4:54 PM Micah Kornfield <[email protected]> wrote: > > The current API for writing repetition and definition levels takes arrays > of int16_t values for the levels. This seems inefficient when there are > "runs" of the same level. When there are runs, writers that don't already > have the data in array form, need to explode it out to an array which > ultimately gets run-length encoded again via RLEEncoder [1]. A similar > inefficiency exists when reading. > > I was wondering if the community has considered adding APIs to the > column_writer/reader [2][3] that expose the encoder/decoder directly or a > facade around them? > > At least for encoding it seems like an extra wrapper for RLEEncoder would > be needed since the required memory size wouldn't be known up front. > > Thanks, > Micah > > [1] > https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h > [2] > https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158 > [3] > https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151
