This is an automated email from the ASF dual-hosted git repository.

dongjoon pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc.git


The following commit(s) were added to refs/heads/main by this push:
     new 25fb75550 ORC-1409: [Docs] Add stream order description in ORC spec
25fb75550 is described below

commit 25fb75550eed7998698e795c184d6eb883ba7729
Author: deshanxiao <[email protected]>
AuthorDate: Tue May 16 13:19:24 2023 -0700

    ORC-1409: [Docs] Add stream order description in ORC spec
    
    ### What changes were proposed in this pull request?
    This PR is aimed to add more description about stream order in ORC spec.
    
    ### Why are the changes needed?
    There are many users who are misled by the order of the document table, in 
fact the stream has no fixed order.
    
    #1450
    
    ### How was this patch tested?
    
    Closes #1465 from deshanxiao/add-order-description.
    
    Authored-by: deshanxiao <[email protected]>
    Signed-off-by: Dongjoon Hyun <[email protected]>
---
 site/specification/ORCv0.md | 28 +++++++++++++++++++++++++++-
 site/specification/ORCv1.md | 27 ++++++++++++++++++++++++++-
 site/specification/ORCv2.md | 27 ++++++++++++++++++++++++++-
 3 files changed, 79 insertions(+), 3 deletions(-)

diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index 3ca477212..de3e4b512 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -501,6 +501,24 @@ uses three streams PRESENT, DATA, and LENGTH, which stores 
the length
 of each value. The details of each type will be presented in the
 following subsections.
 
+There is a general order for index and data streams:
+* Index streams are always placed together in the beginning of the stripe.
+* Data streams are placed together after index streams (if any).
+* Inside index streams or data streams, the unencrypted streams should be
+  placed first and then followed by streams grouped by each encryption variant.
+
+There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:
+* Different stream kinds of the same column can be placed in any order.
+* Streams from different columns can even be placed in any order.
+  To get the precise information (a.k.a stream kind, column id and location) of
+  a stream within a stripe, the streams field in the StripeFooter described 
below
+  is the single source of truth.
+
+In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by **StripeFooter**.
+
 ## Stripe Footer
 
 The stripe footer contains the encoding of each column and the
@@ -566,7 +584,7 @@ message ColumnEncoding {
 }
 ```
 
-# Column Encodings
+# <a id="column-encoding-section">Column Encodings</a>
 
 ## SmallInt, Int, and BigInt Columns
 
@@ -581,6 +599,8 @@ Encoding  | Stream Kind | Optional | Contents
 DIRECT    | PRESENT     | Yes      | Boolean RLE
           | DATA        | No       | Signed Integer RLE v1
 
+> Note that the order of the Stream is not fixed. It also applies to other 
Column types.
+
 ## Float and Double Columns
 
 Floating point types are stored using IEEE 754 floating point bit
@@ -789,3 +809,9 @@ indexes error-prone.
 Because dictionaries are accessed randomly, there is not a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
+
+Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is **fixed**, which may be different to
+the actual data stream placement, and it is the same as
+[Column Encodings](#column-encoding-section) section we described above.
+
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index fd18ae0b8..cb99f6081 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -895,6 +895,24 @@ The layout of each stripe looks like:
    * encryption variant 1..N
 * stripe footer
 
+There is a general order for index and data streams:
+* Index streams are always placed together in the beginning of the stripe.
+* Data streams are placed together after index streams (if any).
+* Inside index streams or data streams, the unencrypted streams should be
+  placed first and then followed by streams grouped by each encryption variant.
+
+There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:
+* Different stream kinds of the same column can be placed in any order.
+* Streams from different columns can even be placed in any order.
+  To get the precise information (a.k.a stream kind, column id and location) of
+  a stream within a stripe, the streams field in the StripeFooter described 
below
+  is the single source of truth.
+
+In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by **StripeFooter**.
+
 ## Stripe Footer
 
 The stripe footer contains the encoding of each column and the
@@ -993,7 +1011,7 @@ message ColumnEncoding {
 }
 ```
 
-# Column Encodings
+# <a id="column-encoding-section">Column Encodings</a>
 
 ## SmallInt, Int, and BigInt Columns
 
@@ -1010,6 +1028,8 @@ DIRECT    | PRESENT     | Yes      | Boolean RLE
 DIRECT_V2 | PRESENT     | Yes      | Boolean RLE
           | DATA        | No       | Signed Integer RLE v2
 
+> Note that the order of the Stream is not fixed. It also applies to other 
Column types.
+
 ## Float and Double Columns
 
 Floating point types are stored using IEEE 754 floating point bit
@@ -1241,6 +1261,11 @@ Because dictionaries are accessed randomly, there is not 
a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is **fixed**, which may be different to
+the actual data stream placement, and it is the same as
+[Column Encodings](#column-encoding-section) section we described above.
+
 ## Bloom Filter Index
 
 Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 73d89cde4..6d82e9e96 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -914,6 +914,24 @@ The layout of each stripe looks like:
    * encryption variant 1..N
 * stripe footer
 
+There is a general order for index and data streams:
+* Index streams are always placed together in the beginning of the stripe.
+* Data streams are placed together after index streams (if any).
+* Inside index streams or data streams, the unencrypted streams should be
+  placed first and then followed by streams grouped by each encryption variant.
+
+There is no fixed order within each unencrypted or encryption variant in the
+index and data streams:
+* Different stream kinds of the same column can be placed in any order.
+* Streams from different columns can even be placed in any order.
+  To get the precise information (a.k.a stream kind, column id and location) of
+  a stream within a stripe, the streams field in the StripeFooter described 
below
+  is the single source of truth.
+
+In the example of the integer column mentioned above, the order of the
+PRESENT stream and the DATA stream cannot be determined in advance.
+We need to get the precise information by **StripeFooter**.
+
 ## Stripe Footer
 
 The stripe footer contains the encoding of each column and the
@@ -1012,7 +1030,7 @@ message ColumnEncoding {
 }
 ```
 
-# Column Encodings
+# <a id="column-encoding-section">Column Encodings</a>
 
 ## SmallInt, Int, and BigInt Columns
 
@@ -1029,6 +1047,8 @@ DIRECT    | PRESENT     | Yes      | Boolean RLE
 DIRECT_V2 | PRESENT     | Yes      | Boolean RLE
           | DATA        | No       | Signed Integer RLE v2
 
+> Note that the order of the Stream is not fixed. It also applies to other 
Column types.
+
 ## Float and Double Columns
 
 Floating point types are stored using IEEE 754 floating point bit
@@ -1257,6 +1277,11 @@ Because dictionaries are accessed randomly, there is not 
a position to
 record for the dictionary and the entire dictionary must be read even
 if only part of a stripe is being read.
 
+Note that for columns with multiple streams, the order of stream
+positions in the RowIndex is **fixed**, which may be different to
+the actual data stream placement, and it is the same as
+[Column Encodings](#column-encoding-section) section we described above.
+
 ## Bloom Filter Index
 
 Bloom Filters are added to ORC indexes from Hive 1.2.0 onwards.

Reply via email to