[GitHub] [arrow] westonpace commented on a diff in pull request #14176: ARROW-16782: [Format] Add REE definitions to FlatBuffers

GitBox Tue, 10 Jan 2023 07:06:39 -0800


westonpace commented on code in PR #14176:
URL: https://github.com/apache/arrow/pull/14176#discussion_r1065897695



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +773,87 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+----------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first child array, called the run 
ends array,
+holds either 16, 32, or 64-bit signed integers. The actual values of each run
+are held in the second child array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the accumulated length of all 
runs 
+from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values. (Contrast this with run-length encoding, in
+which the lengths of the runs are represented directly, and in which random
+access is less efficient.) 
+
+.. note::
+   Because the ``run_ends`` child array cannot have nulls, it's reasonable
+   to consider why the ``run_ends`` are a child array instead of just a
+   buffer, like the offsets for a :ref:`variable-size-list-layout`. This
+   layout was considered, but it was decided to use the child arrays. 
+
+   The full discussion can be found in the `mailing list REE layout 
discussion`_
+   but the gist of it is the following:
+
+   While maintaining two child arrays makes it easier and more efficient to
+   handle slicing run-end encoded arrays, the decision was more to help
+   distinguish between the logical length (there are no buffers with the
+   logical length, but there is an array) and the physical length (all of
+   the buffers are in these child arrays which are sized to the physical 
length).

Review Comment:
   I don't know if we want to justify our decision quite so much on the spec 
page.  Maybe something simpler like:
   
   ```suggestion
      Because the ``run_ends`` child array cannot have nulls, it's reasonable
      to consider why the ``run_ends`` are a child array instead of just a
      buffer, like the offsets for a :ref:`variable-size-list-layout`. This
      layout was considered, but it was decided to use the child arrays. 
   
      Child arrays allow us to keep the "logical length" (the decoded length)
      associated with the parent array and the "physical length" (the number
      of run ends) associated with the child arrays.  If ``run_ends`` was a
      buffer in the parent array then the size of the buffer would be unrelated
      to the length of the array and this would be confusing.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #14176: ARROW-16782: [Format] Add REE definitions to FlatBuffers

Reply via email to