[orc] branch main updated: ORC-1460: Update ORC spec to clarify how dictionary entries are sorted

gangwu Thu, 06 Jul 2023 22:03:23 -0700

This is an automated email from the ASF dual-hosted git repository.

gangwu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc.git



The following commit(s) were added to refs/heads/main by this push:
     new fb93ed903 ORC-1460: Update ORC spec to clarify how dictionary entries 
are sorted
fb93ed903 is described below

commit fb93ed9038f5345b94cacaeb2cb602ee7221737e
Author: Valentin Lorentz <[email protected]>
AuthorDate: Fri Jul 7 13:03:06 2023 +0800

    ORC-1460: Update ORC spec to clarify how dictionary entries are sorted
    
    ### What changes were proposed in this pull request?
    
    The spec is updated to clarify that dictionary entries are sorted based on 
the UTF-8 encoding, not on any of the Unicode Collation algorithms.
    
    ### Why are the changes needed?
    
    Not strictly needed, but I had a moment of doubt when reading the spec, and 
had to check the implementation.
    
    ### How was this patch tested
    
    This matches the C++ implementation:
    
    
https://github.com/apache/orc/blob/294a5e28f7f0420eb1fdc76dffc33608692c1b20/c%2B%2B/src/ColumnWriter.cc#L913-L923
    
    Closes #1561 from progval/spec-sort.
    
    Authored-by: Valentin Lorentz <[email protected]>
    Signed-off-by: Gang Wu <[email protected]>
---
 site/specification/ORCv0.md | 3 ++-
 site/specification/ORCv1.md | 3 ++-
 site/specification/ORCv2.md | 3 ++-
 3 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index de3e4b512..f0840f859 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -626,7 +626,8 @@ the length of each value is written into the LENGTH stream. 
In direct
 encoding, if the values were ["Nevada", "California"]; the DATA
 would be "NevadaCalifornia" and the LENGTH would be [6, 10].
 
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index cb99f6081..28347642e 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -1055,7 +1055,8 @@ the length of each value is written into the LENGTH 
stream. In direct
 encoding, if the values were ["Nevada", "California"]; the DATA
 would be "NevadaCalifornia" and the LENGTH would be [6, 10].
 
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 6d82e9e96..010de73c9 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -1074,7 +1074,8 @@ the length of each value is written into the LENGTH 
stream. In direct
 encoding, if the values were ["Nevada", "California"]; the DATA
 would be "NevadaCalifornia" and the LENGTH would be [6, 10].
 
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
 each unique value are placed into DICTIONARY_DATA. The length of each
 item in the dictionary is put into the LENGTH stream. The DATA stream
 consists of the sequence of references to the dictionary elements.

[orc] branch main updated: ORC-1460: Update ORC spec to clarify how dictionary entries are sorted

Reply via email to