[
https://issues.apache.org/jira/browse/IMPALA-9633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164121#comment-17164121
]
ASF subversion and git services commented on IMPALA-9633:
---------------------------------------------------------
Commit 9c542ef5891f984300f9e5f45406caf145039e75 in impala's branch
refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=9c542ef ]
IMPALA-9633: Implement ds_hll_union()
This function receives a set of sketches produced by ds_hll_sketch()
and merges them into a single sketch.
An example usage is to create a sketch for each partition of a table,
write these sketches to a separate table and based on which partition
the user is interested of the relevant sketches can be union-ed
together to get an estimate. E.g.:
SELECT
ds_hll_estimate(ds_hll_union(sketch_col))
FROM sketch_tbl
WHERE partition_col=1 OR partition_col=5;
Note, currently there is a known limitation of unioning string types
where some input sketches come from Impala and some from Hive. In
this case if there is an overlap in the input data used by Impala and
by Hive this overlapping data is still counted twice due to some
string representation difference between Impala and Hive.
For more details see:
https://issues.apache.org/jira/browse/IMPALA-9939
Testing:
- Apart from the automated tests I added to this patch I also
tested ds_hll_union() on a bigger dataset to check that
serialization, deserialization and merging steps work well. I
took TPCH25.linelitem, created a number of sketches with grouping
by l_shipdate and called ds_hll_union() on those sketches.
Change-Id: I67cdbf6f3ebdb1296fea38465a15642bc9612d09
Reviewed-on: http://gerrit.cloudera.org:8080/16095
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Implement ds_hll_union() builtin function
> -----------------------------------------
>
> Key: IMPALA-9633
> URL: https://issues.apache.org/jira/browse/IMPALA-9633
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend, Frontend
> Reporter: Gabor Kaszab
> Assignee: Gabor Kaszab
> Priority: Major
>
> ds_hll_union() is an aggregating function that accepts sketches and produces
> a single scratch that is the combination of the received scratches.
> Example from Hive:
> {code:java}
> create temporary table sketch_intermediate (category char(1), sketch binary);
> insert into sketch_intermediate select category, ds_hll_sketch(id) from
> sketch_input group by category;
> select ds_hll_estimate(ds_hll_union(sketch)) from sketch_intermediate;
> {code}
> Some test data for the example:
> {code:java}
> create temporary table sketch_input (id int, category char(1));
> insert into table sketch_input values
> (1, 'a'), (2, 'a'), (3, 'a'), (4, 'a'), (5, 'a'), (6, 'a'), (7, 'a'), (8,
> 'a'), (9, 'a'), (10, 'a'),
> (6, 'b'), (7, 'b'), (8, 'b'), (9, 'b'), (10, 'b'), (11, 'b'), (12, 'b'),
> (13, 'b'), (14, 'b'), (15, 'b');
> {code}
> Approximate result:
> {code:java}
> 15.000000521540663
> {code}
> Hive change that introduced the same:
> https://issues.apache.org/jira/browse/HIVE-22940
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]