Csaba Ringhofer created IMPALA-9939:
---------------------------------------

             Summary: Fix Hive interop for HLL with STRING types
                 Key: IMPALA-9939
                 URL: https://issues.apache.org/jira/browse/IMPALA-9939
             Project: IMPALA
          Issue Type: Bug
          Components: Backend
            Reporter: Csaba Ringhofer


It turned out that Impala hashes STRINGs differently than Hive.

Impala's implementation simply hashes the original byte array (e.g. a UTF-8 
encoded string), while Hive hashes the UTF-16 encoded char array behind java 
strings. If the STRING is cast to BINARY in Hive (e.g. ds_hll_sketch(cast(s as 
binary)) ), then it is interoperable with Impala's current implementation.

I am not sure how to proceed - we could UTF-16 encode the strings in Impala 
before hashing, but this would be pretty slow, and I think that Hive actually 
could be also faster if it would hash UTF-8 arrays - as STRINGs are stored as 
org.apache.hadoop.io.Text, they are currently UTF-8 decoded to java string 
first and could be hashed directly without any conversion. This would break 
compatibility with existing Hive produced sketches though.





--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to