[jira] [Commented] (IMPALA-9939) Fix Hive interop for HLL with STRING types

Tim Armstrong (Jira) Thu, 09 Jul 2020 11:35:23 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154792#comment-17154792
 ]


Tim Armstrong commented on IMPALA-9939:
---------------------------------------

+1 on adopting normalised utf-8 as the standard representation. Most file 
formats use utf-8 as the internal representation (Parquet included) so hashing 
anything else would limit how much it could be optimised in future.


> Fix Hive interop for HLL with STRING types
> ------------------------------------------
>
>                 Key: IMPALA-9939
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9939
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> It turned out that Impala hashes STRINGs differently than Hive.
> Impala's implementation simply hashes the original byte array (e.g. a UTF-8 
> encoded string), while Hive hashes the UTF-16 encoded char array behind java 
> strings. If the STRING is cast to BINARY in Hive (e.g. ds_hll_sketch(cast(s 
> as binary)) ), then it is interoperable with Impala's current implementation.
> I am not sure how to proceed - we could UTF-16 encode the strings in Impala 
> before hashing, but this would be pretty slow, and I think that Hive actually 
> could be also faster if it would hash UTF-8 arrays - as STRINGs are stored as 
> org.apache.hadoop.io.Text, they are currently UTF-8 decoded to java string 
> first and could be hashed directly without any conversion. This would break 
> compatibility with existing Hive produced sketches though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-9939) Fix Hive interop for HLL with STRING types

Reply via email to