[GitHub] [arrow] pitrou commented on a diff in pull request #14176: ARROW-16782: [Format] Add REE definitions to FlatBuffers

GitBox Mon, 09 Jan 2023 07:55:57 -0800


pitrou commented on code in PR #14176:
URL: https://github.com/apache/arrow/pull/14176#discussion_r1064785727



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------

Review Comment:
   You need to ensure that the decoration is exactly aligned to the title:
   ```suggestion
   Run-End Encoded Layout
   ----------------------
   ```



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values. (Contrast this with run-length encoding, in
+which the lengths of the runs are represented directly, and in which random
+access is less efficient.)
+
+A run must have have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot 
be
+null.
+
+As an example, you could have the following data: ::
+
+    type: Float32
+    [1.0, 1.0, 1.0, 1.0, null, null, 2.0]
+
+In Run-end-encoded form, this could appear as:
+
+::
+
+    * Length: 7, Null count: 2
+    * Children arrays:
+
+      * run_ends (Int32):
+        * Length: 3, Null count: 0
+        * Validity bitmap buffer: Not required
+        * Values buffer
+
+          | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes  6-63           |

Review Comment:
   ```suggestion
             | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-63           |
   ```



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.

Review Comment:
   There seems to be some redundancy here
   ```suggestion
   by itself, but has two child arrays. The first child array, called the run 
ends array,
   holds either 16, 32, or 64-bit signed integers. The actual values of each run
   are held in the second child array.
   ```



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values. (Contrast this with run-length encoding, in
+which the lengths of the runs are represented directly, and in which random
+access is less efficient.)
+
+A run must have have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot 
be
+null.
+
+As an example, you could have the following data: ::
+
+    type: Float32
+    [1.0, 1.0, 1.0, 1.0, null, null, 2.0]
+
+In Run-end-encoded form, this could appear as:
+
+::
+
+    * Length: 7, Null count: 2
+    * Children arrays:
+
+      * run_ends (Int32):
+        * Length: 3, Null count: 0
+        * Validity bitmap buffer: Not required
+        * Values buffer
+
+          | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes  6-63           |
+          |-------------|-------------|-------------|-----------------------|
+          | 4           | 6           | 7           | unspecified (padding) |
+
+    * values (Float32):
+        * Length: 3, Null count: 1
+        * Validity bitmap buffer:
+
+          | Byte 0 (validity bitmap) | Bytes 1-63            |
+          |--------------------------|-----------------------|
+          | 00000101                 | 0 (padding)           |
+
+        * Values buffer
+
+          | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes  6-63           |

Review Comment:
   ```suggestion
             | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes 12-63           |
   ```



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values. (Contrast this with run-length encoding, in
+which the lengths of the runs are represented directly, and in which random
+access is less efficient.)
+
+A run must have have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot 
be
+null.
+
+As an example, you could have the following data: ::
+
+    type: Float32
+    [1.0, 1.0, 1.0, 1.0, null, null, 2.0]
+
+In Run-end-encoded form, this could appear as:
+
+::
+
+    * Length: 7, Null count: 2
+    * Children arrays:

Review Comment:
   Is it "children arrays" or "child arrays"?



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values. (Contrast this with run-length encoding, in
+which the lengths of the runs are represented directly, and in which random
+access is less efficient.)
+
+A run must have have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot 
be

Review Comment:
   ```suggestion
   run ends array all are positive and in strictly ascending order. A run end 
cannot be
   ```



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values. (Contrast this with run-length encoding, in
+which the lengths of the runs are represented directly, and in which random
+access is less efficient.)
+
+A run must have have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot 
be
+null.
+
+As an example, you could have the following data: ::
+
+    type: Float32
+    [1.0, 1.0, 1.0, 1.0, null, null, 2.0]
+
+In Run-end-encoded form, this could appear as:
+
+::
+
+    * Length: 7, Null count: 2
+    * Children arrays:
+
+      * run_ends (Int32):
+        * Length: 3, Null count: 0
+        * Validity bitmap buffer: Not required
+        * Values buffer
+
+          | Bytes 0-3   | Bytes 4-7   | Bytes 8-11  | Bytes  6-63           |
+          |-------------|-------------|-------------|-----------------------|
+          | 4           | 6           | 7           | unspecified (padding) |
+
+    * values (Float32):

Review Comment:
   This should indented the same as "run_ends" above.



##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,72 @@ application.
 We discuss dictionary encoding as it relates to serialization further
 below.
 
+.. _run-end-encoded-layout:
+
+Run-End Encoded Layout
+-------------------------
+
+Run-end encoding (REE) is a variation of run-length encoding (RLE). These
+encodings are well-suited for representing data containing sequences of the
+same value, called runs. In run-end encoding, each run is represented as a
+value and an integer giving the index in the array where the run ends.
+
+Any array can be run-end encoded. A run-end encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed integer
+called a "run end" for each run. The run ends array can hold either 16, 32, or
+64-bit integers. The actual values of each run are held in the second child 
array.
+For the purposes of determining field names and schemas, these child arrays
+are prescribed the standard names of **run_ends** and **values** respectively.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical

Review Comment:
   This is confusing. Why start by saying they represent the length, and then 
explain that they don't hold the actual length?
   
   



##########
format/Schema.fbs:
##########
@@ -178,6 +178,9 @@ table FixedSizeBinary {
 table Bool {
 }
 
+table RunEndEncoded {

Review Comment:
   Can you add a short comment above to briefly explain this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] pitrou commented on a diff in pull request #14176: ARROW-16782: [Format] Add REE definitions to FlatBuffers

Reply via email to