This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new a6e90de  ARROW-9259: [Format] Add language indicating that unsigned 
dictionary indices are supported but that signed integers are preferred
a6e90de is described below

commit a6e90de29254d34d1e64b0edb804a0587e93daad
Author: Wes McKinney <[email protected]>
AuthorDate: Thu Jul 2 17:48:44 2020 -0500

    ARROW-9259: [Format] Add language indicating that unsigned dictionary 
indices are supported but that signed integers are preferred
    
    This does not alter the format metadata in any way but has implications for 
the reference implementations (e.g. C++ currently rejects unsigned integer 
indices).
    
    Closes #7567 from wesm/format-unsigned-dict-indices
    
    Authored-by: Wes McKinney <[email protected]>
    Signed-off-by: Wes McKinney <[email protected]>
---
 docs/source/format/Columnar.rst | 19 ++++++++++++-------
 format/Schema.fbs               |  7 +++++--
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst
index 9aeaf95..2232afa 100644
--- a/docs/source/format/Columnar.rst
+++ b/docs/source/format/Columnar.rst
@@ -706,13 +706,12 @@ values by integers referencing a **dictionary** usually 
consisting of
 unique values. It can be effective when you have data with many
 repeated values.
 
-Any array can be dictionary-encoded. The dictionary is stored as an
-optional property of an array. When a field is dictionary encoded, the
-values are represented by an array of signed integers representing the
-index of the value in the dictionary. The memory layout for a
-dictionary-encoded array is the same as that of a primitive signed
-integer layout. The dictionary is handled as a separate columnar array
-with its own respective layout.
+Any array can be dictionary-encoded. The dictionary is stored as an optional
+property of an array. When a field is dictionary encoded, the values are
+represented by an array of non-negative integers representing the index of the
+value in the dictionary. The memory layout for a dictionary-encoded array is
+the same as that of a primitive integer layout. The dictionary is handled as a
+separate columnar array with its own respective layout.
 
 As an example, you could have the following data: ::
 
@@ -748,6 +747,12 @@ nulls:
 The null count of such arrays is dictated only by the validity bitmap
 of its indices, irrespective of any null values in the dictionary.
 
+Since unsigned integers can be more difficult to work with in some cases
+(e.g. in the JVM), we recommend preferring signed integers over unsigned
+integers for representing dictionary indices. Additionally, we recommend
+avoiding using 64-bit unsigned integer indices unless they are required by an
+application.
+
 We discuss dictionary encoding as it relates to serialization further
 below.
 
diff --git a/format/Schema.fbs b/format/Schema.fbs
index d834f90..aa364bb 100644
--- a/format/Schema.fbs
+++ b/format/Schema.fbs
@@ -288,8 +288,11 @@ table DictionaryEncoding {
   /// DictionaryBatch messages
   id: long;
 
-  /// The dictionary indices are constrained to be positive integers. If this
-  /// field is null, the indices must be signed int32
+  /// The dictionary indices are constrained to be non-negative integers. If
+  /// this field is null, the indices must be signed int32. To maximize
+  /// cross-language compatibility and performance, implementations are
+  /// recommended to prefer signed integer types over unsigned integer types
+  /// and to avoid uint64 indices unless they are required by an application.
   indexType: Int;
 
   /// By default, dictionaries are not ordered, or the order does not have

Reply via email to