Just wanted to chime in here that I also have several draft PRs for implementing the RLE arrays in Go as the second implementation (since we use two implementations as a requirement to vote on changes/additions to the format).

They can be found here:

<https://github.com/apache/arrow/pull/14111>
<https://github.com/apache/arrow/pull/14114>
<https://github.com/apache/arrow/pull/14126>

--Matt

On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield <emkornfi...@gmail.com> wrote:

  * Should we encode "run lengths" or "run ends"?


I think the project has leaned towards sublinear access, so run ends make sense. The downside is that we run into similar issues with List/LargeList where the total number of elements is limited by bit-width (which can also cause space wastage, e.g. with run ends it might be reasonable to limit
bit-width to 16).

The values are definitely a child array.  However, encoding the run
 ends as a buffer (similar to list array for example) makes it
 difficult to calculate offsets.  Translating an array offset to a
buffer offset takes O(log(N)) time. If the run ends are encoded as a
 child array (so the RLE array has no buffers and two child arrays)
 then this problem goes away.


I'm not sure I understand this, could you provide an example of the problem
that the child array solves?




On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <weston.p...@gmail.com <mailto:weston.p...@gmail.com>> wrote:

 I'm going to bump this because it would be good to get feedback.  In
 particular it would be nice to get feedback on the suggested format
 change[1].  We are currently moving forward on coming up with an IPC
 format proposal which we will share when ready.

 The two interesting points that jump out to me are:

  * Should we encode "run lengths" or "run ends"?

For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths" 3,
 2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
 as that leads to O(log(N)) random access (instead of O(N)) and it's
 not clear there are any disadvantages.

  * Should the run ends be encoded as a buffer or a child array?

 The values are definitely a child array.  However, encoding the run
 ends as a buffer (similar to list array for example) makes it
 difficult to calculate offsets.  Translating an array offset to a
buffer offset takes O(log(N)) time. If the run ends are encoded as a
 child array (so the RLE array has no buffers and two child arrays)
 then this problem goes away.

 [1] <https://github.com/apache/arrow/pull/13333/files>

 On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni
<tob...@zagorni.eu.invalid <mailto:tob...@zagorni.eu.invalid>> wrote:
 >
 > Hello Everyone,
 >
> Recently, I have implemented support for run-length encoding in Arrow
 > C++. So far my implementation is split into different subtasks of
 > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
 >
 > I have (draft) PRs available for:
 > - general handling of RLE in arrow C++, Type, Arrow, Builder
 > subclasses, etc.
 >   (subtasks 1-9)
 > - encode, decode kernels (fixed size only):
 >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
 > - filter kernel (fixed size only):
 >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
 > - simple benchmark for the RLE kernels
 >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
 > - adding RLE to Arrow Columnar format document
 >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
 >
 > What is not yet implemented:
 > - converting RLE to formats like Parquet, JSON, IPC.
> - caching of physical offsets when working with sliced arrays - finding > these offsets is an O(log(n)) binary search which could be avoided in
 > a lot of cases
 >
> I'm interested in any feedback on the code and I'm wondering what would
 > be the best way to get this merged.
 >
> A lot of the PRs depend on earlier ones. I ordered the subtasks in a
 > way they could be merged. The first would be:
 > 1. Handling of array-only types using VisitTypeInline:
 >    <https://issues.apache.org/jira/browse/ARROW-17258>
 > 2. Adding RLE type / array class (only builds on #1):
 >    <https://issues.apache.org/jira/browse/ARROW-17261>
> - also, since it has no dependencies: adding RLE to Arrow Columnar
 > format document
 >    <https://issues.apache.org/jira/browse/ARROW-16773>
 >
 > Best,
 > Tobias


Reply via email to