Hi all,

Quick question about the ORC spec.  In the character types encodings section 
(https://orc.apache.org/docs/encodings.html), it says:

  For dictionary encodings the dictionary is sorted and UTF-8 bytes of each 
unique value are placed into DICTIONARY_DATA.

Is it a requirement that the dictionary be sorted or a suggestion?  

I don’t see any code that takes advantage of this and I believe that this is 
only an effort to improve compression of the dictionary.  If it is a 
requirement, the collation order should be documented.  I believe the current 
implementation is using Java String natural ordering which is UTF-16 big 
endian, which is a bit confusing since the dictionary is UTF-8 encoded.

As a side note, I think this should also be documented in the statistics 
section which also uses UTF-16 BE, which is at least consistent, but still 
annoying for everything other than Java.

Thanks,

-dain

Reply via email to