Re: PRs for RLE support

Matthew Topol Wed, 14 Sep 2022 10:04:45 -0700

Just wanted to chime in here that I also have several draft PRs forimplementing the RLE arrays in Go as the second implementation (sincewe use two implementations as a requirement to vote onchanges/additions to the format).


They can be found here:


<https://github.com/apache/arrow/pull/14111>
<https://github.com/apache/arrow/pull/14114>
<https://github.com/apache/arrow/pull/14126>

--Matt

On Wed, Sep 14 2022 at 09:44:15 AM -0700, Micah Kornfield<[email protected]> wrote:


  * Should we encode "run lengths" or "run ends"?

I think the project has leaned towards sublinear access, so run endsmakesense. The downside is that we run into similar issues withList/LargeListwhere the total number of elements is limited by bit-width (which canalsocause space wastage, e.g. with run ends it might be reasonable tolimit

bit-width to 16).

The values are definitely a child array.  However, encoding the run

 ends as a buffer (similar to list array for example) makes it
 difficult to calculate offsets.  Translating an array offset to a

buffer offset takes O(log(N)) time. If the run ends are encoded asa

 child array (so the RLE array has no buffers and two child arrays)
 then this problem goes away.

I'm not sure I understand this, could you provide an example of theproblem

that the child array solves?

On Wed, Sep 14, 2022 at 9:36 AM Weston Pace <[email protected]<mailto:[email protected]>> wrote:

 I'm going to bump this because it would be good to get feedback.  In
 particular it would be nice to get feedback on the suggested format
 change[1].  We are currently moving forward on coming up with an IPC
 format proposal which we will share when ready.

 The two interesting points that jump out to me are:

  * Should we encode "run lengths" or "run ends"?

For example, should 5,5,5,6,6,7,7,7,7 be encoded with "run lengths"3,

 2, 4 or "run ends" 3, 5, 9.  In the proposal the latter is preferred
 as that leads to O(log(N)) random access (instead of O(N)) and it's
 not clear there are any disadvantages.

  * Should the run ends be encoded as a buffer or a child array?

 The values are definitely a child array.  However, encoding the run
 ends as a buffer (similar to list array for example) makes it
 difficult to calculate offsets.  Translating an array offset to a

buffer offset takes O(log(N)) time. If the run ends are encoded asa

 child array (so the RLE array has no buffers and two child arrays)
 then this problem goes away.

 [1] <https://github.com/apache/arrow/pull/13333/files>

 On Thu, Aug 25, 2022 at 10:35 AM Tobias Zagorni

<[email protected] <mailto:[email protected]>>wrote:

 >
 > Hello Everyone,
 >

> Recently, I have implemented support for run-length encoding inArrow

 > C++. So far my implementation is split into different subtasks of
 > ARROW-16771 (<https://issues.apache.org/jira/browse/ARROW-16771>).
 >
 > I have (draft) PRs available for:
 > - general handling of RLE in arrow C++, Type, Arrow, Builder
 > subclasses, etc.
 >   (subtasks 1-9)
 > - encode, decode kernels (fixed size only):
 >   (<https://issues.apache.org/jira/browse/ARROW-16772>)
 > - filter kernel (fixed size only):
 >   (<https://issues.apache.org/jira/browse/ARROW-16774>)
 > - simple benchmark for the RLE kernels
 >   (<https://issues.apache.org/jira/browse/ARROW-17026>)
 > - adding RLE to Arrow Columnar format document
 >   (<https://issues.apache.org/jira/browse/ARROW-16773>)
 >
 > What is not yet implemented:
 > - converting RLE to formats like Parquet, JSON, IPC.

> - caching of physical offsets when working with sliced arrays -finding> these offsets is an O(log(n)) binary search which could beavoided in

 > a lot of cases
 >

> I'm interested in any feedback on the code and I'm wondering whatwould

 > be the best way to get this merged.
 >

> A lot of the PRs depend on earlier ones. I ordered the subtasksin a

 > way they could be merged. The first would be:
 > 1. Handling of array-only types using VisitTypeInline:
 >    <https://issues.apache.org/jira/browse/ARROW-17258>
 > 2. Adding RLE type / array class (only builds on #1):
 >    <https://issues.apache.org/jira/browse/ARROW-17261>

> - also, since it has no dependencies: adding RLE to ArrowColumnar

 > format document
 >    <https://issues.apache.org/jira/browse/ARROW-16773>
 >
 > Best,
 > Tobias

Re: PRs for RLE support

Reply via email to