Thanks for the feedback Wes. I'll open up a JIRA and see if I can get to this after my work on Arrow to/from Parquet.
On Mon, Mar 9, 2020 at 3:10 PM Wes McKinney <[email protected]> wrote: > hi Micah, > > This sounds like a good idea to me. Not sure why we didn't think of it > before, but better to do this late than never. > > - Wes > > On Sat, Mar 7, 2020 at 4:54 PM Micah Kornfield <[email protected]> > wrote: > > > > The current API for writing repetition and definition levels takes arrays > > of int16_t values for the levels. This seems inefficient when there are > > "runs" of the same level. When there are runs, writers that don't > already > > have the data in array form, need to explode it out to an array which > > ultimately gets run-length encoded again via RLEEncoder [1]. A similar > > inefficiency exists when reading. > > > > I was wondering if the community has considered adding APIs to the > > column_writer/reader [2][3] that expose the encoder/decoder directly or a > > facade around them? > > > > At least for encoding it seems like an extra wrapper for RLEEncoder would > > be needed since the required memory size wouldn't be known up front. > > > > Thanks, > > Micah > > > > [1] > > > https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/arrow/util/rle_encoding.h > > [2] > > > https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_writer.h#L158 > > [3] > > > https://github.com/apache/arrow/blob/8f61c7997e55612940bff4be7b043f0ee61bf238/cpp/src/parquet/column_reader.h#L151 >
