This is an automated email from the ASF dual-hosted git repository.

wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git


The following commit(s) were added to refs/heads/master by this push:
     new 6df8620  ARROW-9222: [Format] Columnar.rst changes for removing 
validity bitmap from union types
6df8620 is described below

commit 6df862096c796f438c1b6cf054f51e2e2228b368
Author: Wes McKinney <[email protected]>
AuthorDate: Thu Jul 2 17:38:19 2020 -0500

    ARROW-9222: [Format] Columnar.rst changes for removing validity bitmap from 
union types
    
    See mailing list discussion.
    
    Closes #7535 from wesm/union-no-validity
    
    Authored-by: Wes McKinney <[email protected]>
    Signed-off-by: Wes McKinney <[email protected]>
---
 docs/source/format/Columnar.rst | 55 +++++++++++++++++++----------------------
 1 file changed, 26 insertions(+), 29 deletions(-)

diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst
index 814c789..9aeaf95 100644
--- a/docs/source/format/Columnar.rst
+++ b/docs/source/format/Columnar.rst
@@ -180,11 +180,13 @@ large as the array length.
 Validity bitmaps
 ----------------
 
-Any type can have null value slots, whether primitive or nested type.
+Any value in an array may be semantically null, whether primitive or nested
+type.
 
-An array with nulls must have a contiguous memory buffer, known as the
-validity (or "null") bitmap, large enough to have at least 1 bit for
-each array slot.
+All array types, with the exception of union types (more on these later),
+utilize a dedicated memory buffer, known as the validity (or "null") bitmap, to
+encode the nullness or non-nullness of each value slot. The validity bitmap
+must be large enough to have at least 1 bit for each array slot.
 
 Whether any array slot is valid (non-null) is encoded in the respective bits of
 this bitmap. A 1 (set bit) for index ``j`` indicates that the value is not 
null,
@@ -208,8 +210,8 @@ bitmap. Implementations may choose to always allocate one 
anyway as a
 matter of convenience, but this should be noted when memory is being
 shared.
 
-Nested type arrays have their own validity bitmap and null count
-regardless of the null count and valid bits of their child arrays.
+Nested type arrays except for union types have their own validity bitmap and
+null count regardless of the null count and valid bits of their child arrays.
 
 Array slots which are null are not required to have a particular
 value; any "masked" memory can have any value and need not be zeroed,
@@ -535,6 +537,10 @@ A union is defined by an ordered sequence of types; each 
slot in the
 union can have a value chosen from these types. The types are named
 like a struct's fields, and the names are part of the type metadata.
 
+Unlike other data types, unions do not have their own validity bitmap. Instead,
+the nullness of each slot is determined exclusively by the child arrays which
+are composed to create the union.
+
 We define two distinct union types, "dense" and "sparse", that are
 optimized for different use cases.
 
@@ -565,38 +571,33 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
 
 ::
 
-    * Length: 4, Null count: 1
-    * Validity bitmap buffer:
-      |Byte 0 (validity bitmap) | Bytes 1-63            |
-      |-------------------------|-----------------------|
-      |00001101                 | 0 (padding)           |
-
+    * Length: 4, Null count: 0
     * Types buffer:
 
       |Byte 0   | Byte 1      | Byte 2   | Byte 3   | Bytes 4-63  |
       |---------|-------------|----------|----------|-------------|
-      | 0       | unspecified | 0        | 1        | unspecified |
+      | 0       | 0           | 0        | 1        | unspecified |
 
     * Offset buffer:
 
       |Bytes 0-3 | Bytes 4-7   | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
       |----------|-------------|------------|-------------|-------------|
-      | 0        | unspecified | 1          | 0           | unspecified |
+      | 0        | 1           | 2          | 0           | unspecified |
 
     * Children arrays:
       * Field-0 array (f: float):
-        * Length: 2, nulls: 0
-        * Validity bitmap buffer: Not required
+        * Length: 2, Null count: 1
+        * Validity bitmap buffer: 00000101
 
         * Value Buffer:
 
-          | Bytes 0-7 | Bytes 8-63  |
-          |-----------|-------------|
-          | 1.2, 3.4  | unspecified |
+          | Bytes 0-11     | Bytes 12-63  |
+          |----------------|-------------|
+          | 1.2, null, 3.4 | unspecified |
 
 
       * Field-1 array (i: int32):
-        * Length: 1, nulls: 0
+        * Length: 1, Null count: 0
         * Validity bitmap buffer: Not required
 
         * Value Buffer:
@@ -628,8 +629,6 @@ For the union array: ::
 will have the following layout: ::
 
     * Length: 6, Null count: 0
-    * Validity bitmap buffer: Not required
-
     * Types buffer:
 
      | Byte 0     | Byte 1      | Byte 2      | Byte 3      | Byte 4      | 
Byte 5       | Bytes  6-63           |
@@ -688,11 +687,9 @@ will have the following layout: ::
             |------------|-----------------------|
             | joemark    | unspecified (padding) |
 
-Similar to structs, a particular child array may have a non-null slot
-even if the validity bitmap of the parent union array indicates the
-slot is null.  Additionally, a child array may have a non-null slot
-even if the types array indicates that a slot contains a different
-type at the index.
+Only the slot in the array corresponding to the type index is considered. All
+"unselected" values are ignored and could be any semantically correct array
+value.
 
 Null Layout
 -----------
@@ -769,8 +766,8 @@ of memory buffers for each layout.
    "List",validity,offsets,
    "Fixed-size List",validity,,
    "Struct",validity,,
-   "Sparse Union",validity,type ids,
-   "Dense Union",validity,type ids,offsets
+   "Sparse Union",type ids,,
+   "Dense Union",type ids,offsets,
    "Null",,,
    "Dictionary-encoded",validity,data (indices),
 

Reply via email to