This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new a6e90de ARROW-9259: [Format] Add language indicating that unsigned
dictionary indices are supported but that signed integers are preferred
a6e90de is described below
commit a6e90de29254d34d1e64b0edb804a0587e93daad
Author: Wes McKinney <[email protected]>
AuthorDate: Thu Jul 2 17:48:44 2020 -0500
ARROW-9259: [Format] Add language indicating that unsigned dictionary
indices are supported but that signed integers are preferred
This does not alter the format metadata in any way but has implications for
the reference implementations (e.g. C++ currently rejects unsigned integer
indices).
Closes #7567 from wesm/format-unsigned-dict-indices
Authored-by: Wes McKinney <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
---
docs/source/format/Columnar.rst | 19 ++++++++++++-------
format/Schema.fbs | 7 +++++--
2 files changed, 17 insertions(+), 9 deletions(-)
diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst
index 9aeaf95..2232afa 100644
--- a/docs/source/format/Columnar.rst
+++ b/docs/source/format/Columnar.rst
@@ -706,13 +706,12 @@ values by integers referencing a **dictionary** usually
consisting of
unique values. It can be effective when you have data with many
repeated values.
-Any array can be dictionary-encoded. The dictionary is stored as an
-optional property of an array. When a field is dictionary encoded, the
-values are represented by an array of signed integers representing the
-index of the value in the dictionary. The memory layout for a
-dictionary-encoded array is the same as that of a primitive signed
-integer layout. The dictionary is handled as a separate columnar array
-with its own respective layout.
+Any array can be dictionary-encoded. The dictionary is stored as an optional
+property of an array. When a field is dictionary encoded, the values are
+represented by an array of non-negative integers representing the index of the
+value in the dictionary. The memory layout for a dictionary-encoded array is
+the same as that of a primitive integer layout. The dictionary is handled as a
+separate columnar array with its own respective layout.
As an example, you could have the following data: ::
@@ -748,6 +747,12 @@ nulls:
The null count of such arrays is dictated only by the validity bitmap
of its indices, irrespective of any null values in the dictionary.
+Since unsigned integers can be more difficult to work with in some cases
+(e.g. in the JVM), we recommend preferring signed integers over unsigned
+integers for representing dictionary indices. Additionally, we recommend
+avoiding using 64-bit unsigned integer indices unless they are required by an
+application.
+
We discuss dictionary encoding as it relates to serialization further
below.
diff --git a/format/Schema.fbs b/format/Schema.fbs
index d834f90..aa364bb 100644
--- a/format/Schema.fbs
+++ b/format/Schema.fbs
@@ -288,8 +288,11 @@ table DictionaryEncoding {
/// DictionaryBatch messages
id: long;
- /// The dictionary indices are constrained to be positive integers. If this
- /// field is null, the indices must be signed int32
+ /// The dictionary indices are constrained to be non-negative integers. If
+ /// this field is null, the indices must be signed int32. To maximize
+ /// cross-language compatibility and performance, implementations are
+ /// recommended to prefer signed integer types over unsigned integer types
+ /// and to avoid uint64 indices unless they are required by an application.
indexType: Int;
/// By default, dictionaries are not ordered, or the order does not have