This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc.git


The following commit(s) were added to refs/heads/main by this push:
     new fb93ed903 ORC-1460: Update ORC spec to clarify how dictionary entries 
are sorted
fb93ed903 is described below

commit fb93ed9038f5345b94cacaeb2cb602ee7221737e
Author: Valentin Lorentz <[email protected]>
AuthorDate: Fri Jul 7 13:03:06 2023 +0800

    ORC-1460: Update ORC spec to clarify how dictionary entries are sorted
    
    ### What changes were proposed in this pull request?
    
    The spec is updated to clarify that dictionary entries are sorted based on 
the UTF-8 encoding, not on any of the Unicode Collation algorithms.
    
    ### Why are the changes needed?
    
    Not strictly needed, but I had a moment of doubt when reading the spec, and 
had to check the implementation.
    
    ### How was this patch tested
    
    This matches the C++ implementation:
    
    
https://github.com/apache/orc/blob/294a5e28f7f0420eb1fdc76dffc33608692c1b20/c%2B%2B/src/ColumnWriter.cc#L913-L923
    
    Closes #1561 from progval/spec-sort.
    
    Authored-by: Valentin Lorentz <[email protected]>
    Signed-off-by: Gang Wu <[email protected]>
---
 site/specification/ORCv0.md | 3 ++-
 site/specification/ORCv1.md | 3 ++-
 site/specification/ORCv2.md | 3 ++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index de3e4b512..f0840f859 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -626,7 +626,8 @@ the length of each value is written into the LENGTH stream. 
In direct
 encoding, if the values were ["Nevada", "California"]; the DATA
 would be "NevadaCalifornia" and the LENGTH would be [6, 10].
 
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index cb99f6081..28347642e 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -1055,7 +1055,8 @@ the length of each value is written into the LENGTH 
stream. In direct
 encoding, if the values were ["Nevada", "California"]; the DATA
 would be "NevadaCalifornia" and the LENGTH would be [6, 10].
 
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 6d82e9e96..010de73c9 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -1074,7 +1074,8 @@ the length of each value is written into the LENGTH 
stream. In direct
 encoding, if the values were ["Nevada", "California"]; the DATA
 would be "NevadaCalifornia" and the LENGTH would be [6, 10].
 
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.

Reply via email to