Thank you Matt, Tobias, and others for the great work on this.

I am -0.5 on this proposal in its current form because (pardon the
pedantry) what we have implemented here is not run-length encoding; it
is run-end encoding. Based on community input, the choice was made to
store run ends instead of run lengths because this enables O(log(N))
random access as opposed to O(N). This is a sensible choice, but it
comes with some trade-offs including limitations in array length
(which maybe not really a problem in practice) and lack of bit-for-bit
equivalence with RLE encodings that use run lengths like Velox's
SequenceVector encoding (which I think is a more serious problem in
practice).

I believe that we should either:
(a) rename this to "run-end encoding"
(b) change this to a parameterized type called "run encoding" that
takes a Boolean parameter specifying whether run lengths or run ends
are stored.

Ian

On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zotthewiz...@gmail.com> wrote:
>
> Hello,
>
> I'd like to propose adding the RLE type based on earlier discussions[1][2]
> to the Arrow format:
> - Columnar Format description:
> https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60
> - Flatbuffers changes:
> https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07
>
> There is a proposed implementation available in both C++ (written by Tobias
> Zagorni) and Go[3][4]. Both implementations have mostly the same tests
> implemented and were tested to be compatible over IPC with an archery test.
> In both cases, the implementations are split out among several Draft PRs so
> that they can be easily reviewed piecemeal if the vote is approved, with
> each Draft PR including the changes of the one before it. The links
> provided are the Draft PRs with the entirety of the changes included.
>
> The vote will be open for at least 72 hours.
>
> [ ] +1 add the proposed RLE type to the Apache Arrow format
> [ ] -1 do not add the proposed RLE type to the Apache Arrow format
> because...
>
> Thanks much, and please let me know if any more information or links are
> needed (I've never proposed a vote before on here!)
>
> --Matt
>
> [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29
> [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq
> [3] https://github.com/apache/arrow/pull/14179
> [4] https://github.com/apache/arrow/pull/14223

Reply via email to