This is an automated email from the ASF dual-hosted git repository.
gangwu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/orc.git
The following commit(s) were added to refs/heads/main by this push:
new fb93ed903 ORC-1460: Update ORC spec to clarify how dictionary entries
are sorted
fb93ed903 is described below
commit fb93ed9038f5345b94cacaeb2cb602ee7221737e
Author: Valentin Lorentz <[email protected]>
AuthorDate: Fri Jul 7 13:03:06 2023 +0800
ORC-1460: Update ORC spec to clarify how dictionary entries are sorted
### What changes were proposed in this pull request?
The spec is updated to clarify that dictionary entries are sorted based on
the UTF-8 encoding, not on any of the Unicode Collation algorithms.
### Why are the changes needed?
Not strictly needed, but I had a moment of doubt when reading the spec, and
had to check the implementation.
### How was this patch tested
This matches the C++ implementation:
https://github.com/apache/orc/blob/294a5e28f7f0420eb1fdc76dffc33608692c1b20/c%2B%2B/src/ColumnWriter.cc#L913-L923
Closes #1561 from progval/spec-sort.
Authored-by: Valentin Lorentz <[email protected]>
Signed-off-by: Gang Wu <[email protected]>
---
site/specification/ORCv0.md | 3 ++-
site/specification/ORCv1.md | 3 ++-
site/specification/ORCv2.md | 3 ++-
3 files changed, 6 insertions(+), 3 deletions(-)
diff --git a/site/specification/ORCv0.md b/site/specification/ORCv0.md
index de3e4b512..f0840f859 100644
--- a/site/specification/ORCv0.md
+++ b/site/specification/ORCv0.md
@@ -626,7 +626,8 @@ the length of each value is written into the LENGTH stream.
In direct
encoding, if the values were ["Nevada", "California"]; the DATA
would be "NevadaCalifornia" and the LENGTH would be [6, 10].
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
each unique value are placed into DICTIONARY_DATA. The length of each
item in the dictionary is put into the LENGTH stream. The DATA stream
consists of the sequence of references to the dictionary elements.
diff --git a/site/specification/ORCv1.md b/site/specification/ORCv1.md
index cb99f6081..28347642e 100644
--- a/site/specification/ORCv1.md
+++ b/site/specification/ORCv1.md
@@ -1055,7 +1055,8 @@ the length of each value is written into the LENGTH
stream. In direct
encoding, if the values were ["Nevada", "California"]; the DATA
would be "NevadaCalifornia" and the LENGTH would be [6, 10].
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
each unique value are placed into DICTIONARY_DATA. The length of each
item in the dictionary is put into the LENGTH stream. The DATA stream
consists of the sequence of references to the dictionary elements.
diff --git a/site/specification/ORCv2.md b/site/specification/ORCv2.md
index 6d82e9e96..010de73c9 100644
--- a/site/specification/ORCv2.md
+++ b/site/specification/ORCv2.md
@@ -1074,7 +1074,8 @@ the length of each value is written into the LENGTH
stream. In direct
encoding, if the values were ["Nevada", "California"]; the DATA
would be "NevadaCalifornia" and the LENGTH would be [6, 10].
-For dictionary encodings the dictionary is sorted and UTF-8 bytes of
+For dictionary encodings the dictionary is sorted (in lexicographical
+order of bytes in the UTF-8 encodings) and UTF-8 bytes of
each unique value are placed into DICTIONARY_DATA. The length of each
item in the dictionary is put into the LENGTH stream. The DATA stream
consists of the sequence of references to the dictionary elements.