Thank you Matt, Tobias, and others for the great work on this. I am -0.5 on this proposal in its current form because (pardon the pedantry) what we have implemented here is not run-length encoding; it is run-end encoding. Based on community input, the choice was made to store run ends instead of run lengths because this enables O(log(N)) random access as opposed to O(N). This is a sensible choice, but it comes with some trade-offs including limitations in array length (which maybe not really a problem in practice) and lack of bit-for-bit equivalence with RLE encodings that use run lengths like Velox's SequenceVector encoding (which I think is a more serious problem in practice).
I believe that we should either: (a) rename this to "run-end encoding" (b) change this to a parameterized type called "run encoding" that takes a Boolean parameter specifying whether run lengths or run ends are stored. Ian On Wed, Dec 14, 2022 at 11:27 AM Matt Topol <zotthewiz...@gmail.com> wrote: > > Hello, > > I'd like to propose adding the RLE type based on earlier discussions[1][2] > to the Arrow format: > - Columnar Format description: > https://github.com/apache/arrow/pull/13333/files#diff-8b68cf6859e881f2357f5df64bb073135d7ff6eeb51f116418660b3856564c60 > - Flatbuffers changes: > https://github.com/apache/arrow/pull/14176/files#diff-e54b4f5d2d279acc5d1df5df9a7636f0142a8041fe02f07034e0d8be48444b07 > > There is a proposed implementation available in both C++ (written by Tobias > Zagorni) and Go[3][4]. Both implementations have mostly the same tests > implemented and were tested to be compatible over IPC with an archery test. > In both cases, the implementations are split out among several Draft PRs so > that they can be easily reviewed piecemeal if the vote is approved, with > each Draft PR including the changes of the one before it. The links > provided are the Draft PRs with the entirety of the changes included. > > The vote will be open for at least 72 hours. > > [ ] +1 add the proposed RLE type to the Apache Arrow format > [ ] -1 do not add the proposed RLE type to the Apache Arrow format > because... > > Thanks much, and please let me know if any more information or links are > needed (I've never proposed a vote before on here!) > > --Matt > > [1] https://lists.apache.org/thread/bfz3m5nyf7flq7n6q9b1bx3jhcn4wq29 > [2] https://lists.apache.org/thread/xb7c723csrtwt0md3m4p56bt0193n7jq > [3] https://github.com/apache/arrow/pull/14179 > [4] https://github.com/apache/arrow/pull/14223