[ 
https://issues.apache.org/jira/browse/DATASKETCHES-8?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam Tamas updated DATASKETCHES-8:
----------------------------------
    Description: 
Using ds_hll Hive is not counting empty strings as distinct values for string 
and varchar columns.

Example:
With a t table with the following (string, char(1), varchar(1)) values:

{code:java}
+------+------+------+
| t.s       | t.c      | t.v      |
+------+------+------+
|            |           |            |
| a         | a        | a         |
|            |           |            |
| a         | a        | a         |
| s         | s        | s         |
| d         | d       | d         |
+------+------+------+
{code}


select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
ds_hll_estimate(ds_hll_sketch(v)) from t;


{code:java}
+--------------------+--------------------+--------------------+
|        _c0                      |        _c1                      |        
_c2                     |
+--------------------+--------------------+--------------------+
| 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
+--------------------+--------------------+--------------------+
{code}


  was:
Using ds_hll Hive is not counting empty strings as distinct values for string 
and varchar columns.

Example:
With a t table with the following (string, char(1), varchar(1)) values:
+------+------+------+
| t.s       | t.c      | t.v      |
+------+------+------+
|            |           |            |
| a         | a        | a         |
|            |           |            |
| a         | a        | a         |
| s         | s        | s         |
| d         | d       | d         |
+------+------+------+

select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
ds_hll_estimate(ds_hll_sketch(v)) from t;

+--------------------+--------------------+--------------------+
|        _c0                      |        _c1                      |        
_c2                     |
+--------------------+--------------------+--------------------+
| 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
+--------------------+--------------------+--------------------+


> HLL doesn't take empty strings as distinct values
> -------------------------------------------------
>
>                 Key: DATASKETCHES-8
>                 URL: https://issues.apache.org/jira/browse/DATASKETCHES-8
>             Project: Apache Datasketches
>          Issue Type: Bug
>            Reporter: Adam Tamas
>            Priority: Major
>
> Using ds_hll Hive is not counting empty strings as distinct values for string 
> and varchar columns.
> Example:
> With a t table with the following (string, char(1), varchar(1)) values:
> {code:java}
> +------+------+------+
> | t.s       | t.c      | t.v      |
> +------+------+------+
> |            |           |            |
> | a         | a        | a         |
> |            |           |            |
> | a         | a        | a         |
> | s         | s        | s         |
> | d         | d       | d         |
> +------+------+------+
> {code}
> select ds_hll_estimate(ds_hll_sketch(s)), ds_hll_estimate(ds_hll_sketch(c)), 
> ds_hll_estimate(ds_hll_sketch(v)) from t;
> {code:java}
> +--------------------+--------------------+--------------------+
> |        _c0                      |        _c1                      |        
> _c2                     |
> +--------------------+--------------------+--------------------+
> | 3.000000014901161  | 4.000000029802323  | 3.000000014901161  |
> +--------------------+--------------------+--------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@datasketches.apache.org
For additional commands, e-mail: dev-h...@datasketches.apache.org

Reply via email to