ianmcook commented on code in PR #14176: URL: https://github.com/apache/arrow/pull/14176#discussion_r1060929748
########## docs/source/format/Columnar.rst: ########## @@ -765,6 +765,68 @@ application. We discuss dictionary encoding as it relates to serialization further below. +.. _run-end-encoded-layout: + +Run-End Encoded Layout +------------------------- + +Run-End is a data representation that represents data as sequences of the +same value, called runs. Each run is represented as a value, and an integer +describing the index in the array where the run ends. + +Any array can be run-end encoded. A run-end encoded array has no buffers +by itself, but has two child arrays. The first one holds a signed integer +called a "run end" for each run. The run ends array can hold either 16, 32, or +64-bit integers. The actual values of each run are held +the second child array. + +The values in the first child array represent the length of each run. They do +not hold the length of the respective run directly, but the accumulated length +of all runs from the first to the current one, i.e. the logical index where the +current run ends. This allows relatively efficient random access from a logical +index using binary search. The length of an individual run can be determined by +subtracting two adjacent values. Review Comment: ```suggestion index using binary search. The length of an individual run can be determined by subtracting two adjacent values. (Contrast this with run-length encoding, in which the lengths of the runs are represented directly, and in which random access is less efficient.) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
