This is an automated email from the ASF dual-hosted git repository.
dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc.git
The following commit(s) were added to refs/heads/main by this push:
new 25fb75550 ORC-1409: [Docs] Add stream order description in ORC spec
25fb75550 is described below
commit 25fb75550eed7998698e795c184d6eb883ba7729
Author: deshanxiao <[email protected]>
AuthorDate: Tue May 16 13:19:24 2023 -0700
ORC-1409: [Docs] Add stream order description in ORC spec
### What changes were proposed in this pull request?
This PR is aimed to add more description about stream order in ORC spec.
### Why are the changes needed?
There are many users who are misled by the order of the document table, in
fact the stream has no fixed order.
#1450
### How was this patch tested?
Closes #1465 from deshanxiao/add-order-description.
Authored-by: deshanxiao <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
---
site/specification/ORCv0.md | 28 +++++++++++++++++++++++++++-
site/specification/ORCv1.md | 27 ++++++++++++++++++++++++++-
site/specification/ORCv2.md | 27 ++++++++++++++++++++++++++-
3 files changed, 79 insertions(+), 3 deletions(-)
diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index 3ca477212..de3e4b512 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -501,6 +501,24 @@ uses three streams PRESENT, DATA, and LENGTH, which stores
the length
of each value. The details of each type will be presented in the
following subsections.
+There is a general order for index and data streams:
+* Index streams are always placed together in the beginning of the stripe.
+* Data streams are placed together after index streams (if any).
+* Inside index streams or data streams, the unencrypted streams should be
+ placed first and then followed by streams grouped by each encryption variant.
+
+There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:
+* Different stream kinds of the same column can be placed in any order.
+* Streams from different columns can even be placed in any order.
+ To get the precise information (a.k.a stream kind, column id and location) of
+ a stream within a stripe, the streams field in the StripeFooter described
below
+ is the single source of truth.
+
+In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by **StripeFooter**.
+
## Stripe Footer
The stripe footer contains the encoding of each column and the
@@ -566,7 +584,7 @@ message ColumnEncoding {
}
```
-# Column Encodings
+# <a id="column-encoding-section">Column Encodings</a>
## SmallInt, Int, and BigInt Columns
@@ -581,6 +599,8 @@ Encoding | Stream Kind | Optional | Contents
DIRECT | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v1
+> Note that the order of the Stream is not fixed. It also applies to other
Column types.
+
## Float and Double Columns
Floating point types are stored using IEEE 754 floating point bit
@@ -789,3 +809,9 @@ indexes error-prone.
Because dictionaries are accessed randomly, there is not a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+
+Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is **fixed**, which may be different to
+the actual data stream placement, and it is the same as
+[Column Encodings](#column-encoding-section) section we described above.
+
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index fd18ae0b8..cb99f6081 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -895,6 +895,24 @@ The layout of each stripe looks like:
* encryption variant 1..N
* stripe footer
+There is a general order for index and data streams:
+* Index streams are always placed together in the beginning of the stripe.
+* Data streams are placed together after index streams (if any).
+* Inside index streams or data streams, the unencrypted streams should be
+ placed first and then followed by streams grouped by each encryption variant.
+
+There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:
+* Different stream kinds of the same column can be placed in any order.
+* Streams from different columns can even be placed in any order.
+ To get the precise information (a.k.a stream kind, column id and location) of
+ a stream within a stripe, the streams field in the StripeFooter described
below
+ is the single source of truth.
+
+In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by **StripeFooter**.
+
## Stripe Footer
The stripe footer contains the encoding of each column and the
@@ -993,7 +1011,7 @@ message ColumnEncoding {
}
```
-# Column Encodings
+# <a id="column-encoding-section">Column Encodings</a>
## SmallInt, Int, and BigInt Columns
@@ -1010,6 +1028,8 @@ DIRECT | PRESENT | Yes | Boolean RLE
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
+> Note that the order of the Stream is not fixed. It also applies to other
Column types.
+
## Float and Double Columns
Floating point types are stored using IEEE 754 floating point bit
@@ -1241,6 +1261,11 @@ Because dictionaries are accessed randomly, there is not
a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is **fixed**, which may be different to
+the actual data stream placement, and it is the same as
+[Column Encodings](#column-encoding-section) section we described above.
+
## Bloom Filter Index
Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 73d89cde4..6d82e9e96 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -914,6 +914,24 @@ The layout of each stripe looks like:
* encryption variant 1..N
* stripe footer
+There is a general order for index and data streams:
+* Index streams are always placed together in the beginning of the stripe.
+* Data streams are placed together after index streams (if any).
+* Inside index streams or data streams, the unencrypted streams should be
+ placed first and then followed by streams grouped by each encryption variant.
+
+There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:
+* Different stream kinds of the same column can be placed in any order.
+* Streams from different columns can even be placed in any order.
+ To get the precise information (a.k.a stream kind, column id and location) of
+ a stream within a stripe, the streams field in the StripeFooter described
below
+ is the single source of truth.
+
+In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by **StripeFooter**.
+
## Stripe Footer
The stripe footer contains the encoding of each column and the
@@ -1012,7 +1030,7 @@ message ColumnEncoding {
}
```
-# Column Encodings
+# <a id="column-encoding-section">Column Encodings</a>
## SmallInt, Int, and BigInt Columns
@@ -1029,6 +1047,8 @@ DIRECT | PRESENT | Yes | Boolean RLE
DIRECT_V2 | PRESENT | Yes | Boolean RLE
| DATA | No | Signed Integer RLE v2
+> Note that the order of the Stream is not fixed. It also applies to other
Column types.
+
## Float and Double Columns
Floating point types are stored using IEEE 754 floating point bit
@@ -1257,6 +1277,11 @@ Because dictionaries are accessed randomly, there is not
a position to
record for the dictionary and the entire dictionary must be read even
if only part of a stripe is being read.
+Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is **fixed**, which may be different to
+the actual data stream placement, and it is the same as
+[Column Encodings](#column-encoding-section) section we described above.
+
## Bloom Filter Index
Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.