[jira] [Created] (IMPALA-9942) DataSketches HLL shouldn't take empty strings as distinct values

Gabor Kaszab (Jira) Fri, 10 Jul 2020 01:34:11 -0700

Gabor Kaszab created IMPALA-9942:
------------------------------------

             Summary: DataSketches HLL shouldn't take empty strings as distinct 
values
                 Key: IMPALA-9942
                 URL: https://issues.apache.org/jira/browse/IMPALA-9942
             Project: IMPALA
          Issue Type: Improvement
          Components: Backend
    Affects Versions: Impala 4.0
            Reporter: Gabor Kaszab
            Assignee: Gabor Kaszab



Let's consider a table that has string, char and varchar columns and some of 
the values in these columns are empty strings.
{code:java}
select * from strings;
+-----+------------+-----+
| s   | c          | v   |
+-----+------------+-----+
|     |            |     |
| abc | abc        | abc |
|     |            |     |
+-----+------------+-----+
{code}
If I query the # of distinct values by DataSketches HLL then the empty string 
add +1 to the end result.
{code:java}
select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
ds_hll_estimate(ds_hll_sketch(v)) from strings;
+------------+----------+-------------+
| hll_string | hll_char | hll_varchar |
+------------+----------+-------------+
| 2          | 2        | 2           |
+------------+----------+-------------+
{code}

However, Hive's implementation omits empty strings so for this particular 
example above Hive would return 1 for each column.

I assume omits empty strings because of this line:
https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/BaseHllSketch.java#L351

First step of this task would be to decide which approach is the correct one, 
and as a second step do the adjustment in Impala if we decide that way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (IMPALA-9942) DataSketches HLL shouldn't take empty strings as distinct values

Reply via email to