RLE would probably have some benefits that it makes sense to evaluate, I would personally go in the direction of having a minimal benchmarking suite for some of the cases where we expect to seem most benefit (IE: filtering) so we can discuss with real numbers.
Also, the currently proposed format divides run lengths and values, maybe a format where run lengths and values are stored interleaved in the same buffer might be able to allow more optimisations in the contest of vectorised operations. Even though it might be harder to work with for things that are not fixed width. On Tue, Jun 7, 2022 at 7:56 PM Tobias Zagorni <tob...@zagorni.eu.invalid> wrote: > I created a Jira for adding RLE as ARROW-16771, and draft PRs: > > - https://github.com/apache/arrow/pull/13330 > Encode/Decode functions for (currently fixed width types only) > > - https://github.com/apache/arrow/pull/13333 > For updating docs > > Best, > Tobias > > Am Dienstag, dem 31.05.2022 um 17:13 -0500 schrieb Wes McKinney: > > I haven't had a chance to look at the branch in detail, but if you > > can > > provide a pointer to a specification or other details about the > > proposed memory format for RLE (basically: what would be added to the > > columnar documentation as well as the Flatbuffers schema files), it > > would be helpful so it can be circulated to some other interested > > parties working primarily outside of Arrow (e.g. DuckDB) who might > > like to converge on a standard especially given that it would be > > exported across the C data interface. Thanks! > >