[ 
https://issues.apache.org/jira/browse/IMPALA-9939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164122#comment-17164122
 ] 

ASF subversion and git services commented on IMPALA-9939:
---------------------------------------------------------

Commit 9c542ef5891f984300f9e5f45406caf145039e75 in impala's branch 
refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9c542ef ]

IMPALA-9633: Implement ds_hll_union()

This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.

An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
  SELECT
      ds_hll_estimate(ds_hll_union(sketch_col))
  FROM sketch_tbl
  WHERE partition_col=1 OR partition_col=5;

Note, currently there is a known limitation of unioning string types
where some input sketches come from Impala and some from Hive. In
this case if there is an overlap in the input data used by Impala and
by Hive this overlapping data is still counted twice due to some
string representation difference between Impala and Hive.
For more details see:
https://issues.apache.org/jira/browse/IMPALA-9939

Testing:
  - Apart from the automated tests I added to this patch I also
    tested ds_hll_union() on a bigger dataset to check that
    serialization, deserialization and merging steps work well. I
    took TPCH25.linelitem, created a number of sketches with grouping
    by l_shipdate and called ds_hll_union() on those sketches.

Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Reviewed-on: http://gerrit.cloudera.org:8080/16095
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Fix Hive interop for HLL with STRING types
> ------------------------------------------
>
>                 Key: IMPALA-9939
>                 URL: https://issues.apache.org/jira/browse/IMPALA-9939
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> It turned out that Impala hashes STRINGs differently than Hive.
> Impala's implementation simply hashes the original byte array (e.g. a UTF-8 
> encoded string), while Hive hashes the UTF-16 encoded char array behind java 
> strings. If the STRING is cast to BINARY in Hive (e.g. ds_hll_sketch(cast(s 
> as binary)) ), then it is interoperable with Impala's current implementation.
> I am not sure how to proceed - we could UTF-16 encode the strings in Impala 
> before hashing, but this would be pretty slow, and I think that Hive actually 
> could be also faster if it would hash UTF-8 arrays - as STRINGs are stored as 
> org.apache.hadoop.io.Text, they are currently UTF-8 decoded to java string 
> first and could be hashed directly without any conversion. This would break 
> compatibility with existing Hive produced sketches though.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to