This is an automated email from the ASF dual-hosted git repository.
wesm pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/arrow.git
The following commit(s) were added to refs/heads/master by this push:
new 6df8620 ARROW-9222: [Format] Columnar.rst changes for removing
validity bitmap from union types
6df8620 is described below
commit 6df862096c796f438c1b6cf054f51e2e2228b368
Author: Wes McKinney <[email protected]>
AuthorDate: Thu Jul 2 17:38:19 2020 -0500
ARROW-9222: [Format] Columnar.rst changes for removing validity bitmap from
union types
See mailing list discussion.
Closes #7535 from wesm/union-no-validity
Authored-by: Wes McKinney <[email protected]>
Signed-off-by: Wes McKinney <[email protected]>
---
docs/source/format/Columnar.rst | 55 +++++++++++++++++++----------------------
1 file changed, 26 insertions(+), 29 deletions(-)
diff --git a/docs/source/format/Columnar.rst b/docs/source/format/Columnar.rst
index 814c789..9aeaf95 100644
--- a/docs/source/format/Columnar.rst
+++ b/docs/source/format/Columnar.rst
@@ -180,11 +180,13 @@ large as the array length.
Validity bitmaps
----------------
-Any type can have null value slots, whether primitive or nested type.
+Any value in an array may be semantically null, whether primitive or nested
+type.
-An array with nulls must have a contiguous memory buffer, known as the
-validity (or "null") bitmap, large enough to have at least 1 bit for
-each array slot.
+All array types, with the exception of union types (more on these later),
+utilize a dedicated memory buffer, known as the validity (or "null") bitmap, to
+encode the nullness or non-nullness of each value slot. The validity bitmap
+must be large enough to have at least 1 bit for each array slot.
Whether any array slot is valid (non-null) is encoded in the respective bits of
this bitmap. A 1 (set bit) for index ``j`` indicates that the value is not
null,
@@ -208,8 +210,8 @@ bitmap. Implementations may choose to always allocate one
anyway as a
matter of convenience, but this should be noted when memory is being
shared.
-Nested type arrays have their own validity bitmap and null count
-regardless of the null count and valid bits of their child arrays.
+Nested type arrays except for union types have their own validity bitmap and
+null count regardless of the null count and valid bits of their child arrays.
Array slots which are null are not required to have a particular
value; any "masked" memory can have any value and need not be zeroed,
@@ -535,6 +537,10 @@ A union is defined by an ordered sequence of types; each
slot in the
union can have a value chosen from these types. The types are named
like a struct's fields, and the names are part of the type metadata.
+Unlike other data types, unions do not have their own validity bitmap. Instead,
+the nullness of each slot is determined exclusively by the child arrays which
+are composed to create the union.
+
We define two distinct union types, "dense" and "sparse", that are
optimized for different use cases.
@@ -565,38 +571,33 @@ having the values: ``[{f=1.2}, null, {f=3.4}, {i=5}]``
::
- * Length: 4, Null count: 1
- * Validity bitmap buffer:
- |Byte 0 (validity bitmap) | Bytes 1-63 |
- |-------------------------|-----------------------|
- |00001101 | 0 (padding) |
-
+ * Length: 4, Null count: 0
* Types buffer:
|Byte 0 | Byte 1 | Byte 2 | Byte 3 | Bytes 4-63 |
|---------|-------------|----------|----------|-------------|
- | 0 | unspecified | 0 | 1 | unspecified |
+ | 0 | 0 | 0 | 1 | unspecified |
* Offset buffer:
|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-63 |
|----------|-------------|------------|-------------|-------------|
- | 0 | unspecified | 1 | 0 | unspecified |
+ | 0 | 1 | 2 | 0 | unspecified |
* Children arrays:
* Field-0 array (f: float):
- * Length: 2, nulls: 0
- * Validity bitmap buffer: Not required
+ * Length: 2, Null count: 1
+ * Validity bitmap buffer: 00000101
* Value Buffer:
- | Bytes 0-7 | Bytes 8-63 |
- |-----------|-------------|
- | 1.2, 3.4 | unspecified |
+ | Bytes 0-11 | Bytes 12-63 |
+ |----------------|-------------|
+ | 1.2, null, 3.4 | unspecified |
* Field-1 array (i: int32):
- * Length: 1, nulls: 0
+ * Length: 1, Null count: 0
* Validity bitmap buffer: Not required
* Value Buffer:
@@ -628,8 +629,6 @@ For the union array: ::
will have the following layout: ::
* Length: 6, Null count: 0
- * Validity bitmap buffer: Not required
-
* Types buffer:
| Byte 0 | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
Byte 5 | Bytes 6-63 |
@@ -688,11 +687,9 @@ will have the following layout: ::
|------------|-----------------------|
| joemark | unspecified (padding) |
-Similar to structs, a particular child array may have a non-null slot
-even if the validity bitmap of the parent union array indicates the
-slot is null. Additionally, a child array may have a non-null slot
-even if the types array indicates that a slot contains a different
-type at the index.
+Only the slot in the array corresponding to the type index is considered. All
+"unselected" values are ignored and could be any semantically correct array
+value.
Null Layout
-----------
@@ -769,8 +766,8 @@ of memory buffers for each layout.
"List",validity,offsets,
"Fixed-size List",validity,,
"Struct",validity,,
- "Sparse Union",validity,type ids,
- "Dense Union",validity,type ids,offsets
+ "Sparse Union",type ids,,
+ "Dense Union",type ids,offsets,
"Null",,,
"Dictionary-encoded",validity,data (indices),