Hi all, Quick question about the ORC spec. In the character types encodings section (https://orc.apache.org/docs/encodings.html), it says:
For dictionary encodings the dictionary is sorted and UTF-8 bytes of each unique value are placed into DICTIONARY_DATA. Is it a requirement that the dictionary be sorted or a suggestion? I don’t see any code that takes advantage of this and I believe that this is only an effort to improve compression of the dictionary. If it is a requirement, the collation order should be documented. I believe the current implementation is using Java String natural ordering which is UTF-16 big endian, which is a bit confusing since the dictionary is UTF-8 encoded. As a side note, I think this should also be documented in the statistics section which also uses UTF-16 BE, which is at least consistent, but still annoying for everything other than Java. Thanks, -dain