alamb commented on code in PR #13333:
URL: https://github.com/apache/arrow/pull/13333#discussion_r971268286
##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,66 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.
+.. _run-length-encoded-layout:
+
+Run-Length Encoded Layout
+-------------------------
+
+Run-Length is a data representation that represents data as sequences of the
+same value, called runs. Each run is represented as a value, and an integer
+describing how often this value is repeated.
+
+Any array can be run-length encoded. A run-length encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed 32-bit
integer
Review Comment:
```suggestion
by itself, but has two child arrays. The first one holds a signed 32-bit
integer called a "run end"
```
##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,66 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.
+.. _run-length-encoded-layout:
+
+Run-Length Encoded Layout
+-------------------------
+
+Run-Length is a data representation that represents data as sequences of the
+same value, called runs. Each run is represented as a value, and an integer
+describing how often this value is repeated.
+
+Any array can be run-length encoded. A run-length encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed 32-bit
integer
+for each run. The actual values of each run are held the second child array.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values.
+
+A run has to have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot
be
+null.
+
+As an example, you could have the following data: ::
+
+ type: Float32
+ [1.0, 1.0, 1.0, 1.0, null, null, 2.0]
+
+In Run-length-encoded form, this could appear as:
+
+::
+
+ * Length: 7, Null count: 2
+ * Children arrays:
+
+ * run ends (Int32):
+ * Length: 3, Null count: 0
+ * Validity bitmap buffer: Not required
Review Comment:
```suggestion
* Validity bitmap buffer: Not present (not allowed)
```
See comment above
##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,66 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.
+.. _run-length-encoded-layout:
+
+Run-Length Encoded Layout
+-------------------------
+
+Run-Length is a data representation that represents data as sequences of the
+same value, called runs. Each run is represented as a value, and an integer
+describing how often this value is repeated.
+
+Any array can be run-length encoded. A run-length encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed 32-bit
integer
+for each run. The actual values of each run are held the second child array.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values.
+
+A run has to have a length of at least 1. This means the values in the
+run ends array all positive and in strictly ascending order. A run end cannot
be
+null.
Review Comment:
I recommend adding something to this section saying "RLE Arrays must not
have have a validity bitmap" which would also imply a run end could not be
null. As worded I think it is unclear if a validity bitmap is allowed but must
be all `valid` or if the bitmap is not allowed
##########
docs/source/format/Columnar.rst:
##########
@@ -765,6 +765,66 @@ application.
We discuss dictionary encoding as it relates to serialization further
below.
+.. _run-length-encoded-layout:
+
+Run-Length Encoded Layout
+-------------------------
+
+Run-Length is a data representation that represents data as sequences of the
+same value, called runs. Each run is represented as a value, and an integer
+describing how often this value is repeated.
+
+Any array can be run-length encoded. A run-length encoded array has no buffers
+by itself, but has two child arrays. The first one holds a signed 32-bit
integer
+for each run. The actual values of each run are held the second child array.
+
+The values in the first child array represent the length of each run. They do
+not hold the length of the respective run directly, but the accumulated length
+of all runs from the first to the current one, i.e. the logical index where the
+current run ends. This allows relatively efficient random access from a logical
+index using binary search. The length of an individual run can be determined by
+subtracting two adjacent values.
+
+A run has to have a length of at least 1. This means the values in the
Review Comment:
```suggestion
A run must have have a length of at least 1. This means the values in the
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]