I'm going to bump this because it would be good to get feedback. In
particular it would be nice to get feedback on the suggested format
change[1]. We are currently moving forward on coming up with an IPC
format proposal which we will share when ready.
The two interesting points that jump out to me are:
* Should we encode "run lengths" or "run ends"?
For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"
3,
2, 4 or "run ends" 3, 5, 9. In the proposal the latter is preferred
as that leads to O(log(N)) random access (instead of O(N)) and it's
not clear there are any disadvantages.
* Should the run ends be encoded as a buffer or a child array?
The values are definitely a child array. However, encoding the run
ends as a buffer (similar to list array for example) makes it
difficult to calculate offsets. Translating an array offset to a
buffer offset takes O(log(N)) time. If the run ends are encoded as
a
child array (so the RLE array has no buffers and two child arrays)
then this problem goes away.
[1] <https://github.com/apache/arrow/pull/13333/files>
On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
<tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid>>
wrote:
>
> Hello Everyone,
>
> Recently, I have implemented support for run-length encoding in
Arrow
> C++. So far my implementation is split into different subtasks of
> ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
>
> I have (draft) PRs available for:
> - general handling of RLE in arrow C++, Type, Arrow, Builder
> subclasses, etc.
> (subtasks 1-9)
> - encode, decode kernels (fixed size only):
> (<https://issues.apache.org/jira/browse/ARROW-16772>)
> - filter kernel (fixed size only):
> (<https://issues.apache.org/jira/browse/ARROW-16774>)
> - simple benchmark for the RLE kernels
> (<https://issues.apache.org/jira/browse/ARROW-17026>)
> - adding RLE to Arrow Columnar format document
> (<https://issues.apache.org/jira/browse/ARROW-16773>)
>
> What is not yet implemented:
> - converting RLE to formats like Parquet, JSON, IPC.
> - caching of physical offsets when working with sliced arrays -
finding
> these offsets is an O(log(n)) binary search which could be
avoided in
> a lot of cases
>
> I'm interested in any feedback on the code and I'm wondering what
would
> be the best way to get this merged.
>
> A lot of the PRs depend on earlier ones. I ordered the subtasks
in a
> way they could be merged. The first would be:
> 1. Handling of array-only types using VisitTypeInline:
> <https://issues.apache.org/jira/browse/ARROW-17258>
> 2. Adding RLE type / array class (only builds on #1):
> <https://issues.apache.org/jira/browse/ARROW-17261>
> - also, since it has no dependencies: adding RLE to Arrow
Columnar
> format document
> <https://issues.apache.org/jira/browse/ARROW-16773>
>
> Best,
> Tobias