[
https://issues.apache.org/jira/browse/IMPALA-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Kaszab closed IMPALA-9939.
--------------------------------
Fix Version/s: Not Applicable
Resolution: Won't Fix
> Fix Hive interop for HLL with STRING types
> ------------------------------------------
>
> Key: IMPALA-9939
> URL: https://issues.apache.org/jira/browse/IMPALA-9939
> Project: IMPALA
> Issue Type: Bug
> Components: Backend
> Reporter: Csaba Ringhofer
> Priority: Major
> Fix For: Not Applicable
>
>
> It turned out that Impala hashes STRINGs differently than Hive.
> Impala's implementation simply hashes the original byte array (e.g. a UTF-8
> encoded string), while Hive hashes the UTF-16 encoded char array behind java
> strings. If the STRING is cast to BINARY in Hive (e.g. ds_hll_sketch(cast(s
> as binary)) ), then it is interoperable with Impala's current implementation.
> I am not sure how to proceed - we could UTF-16 encode the strings in Impala
> before hashing, but this would be pretty slow, and I think that Hive actually
> could be also faster if it would hash UTF-8 arrays - as STRINGs are stored as
> org.apache.hadoop.io.Text, they are currently UTF-8 decoded to java string
> first and could be hashed directly without any conversion. This would break
> compatibility with existing Hive produced sketches though.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)