[ 
https://issues.apache.org/jira/browse/IMPALA-10019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17185346#comment-17185346
 ] 

ASF subversion and git services commented on IMPALA-10019:
----------------------------------------------------------

Commit a8a35edbc407e02187279ac8d090a345d026c57a in impala's branch 
refs/heads/master from Gabor Kaszab
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=a8a35ed ]

IMPALA-10019: Implement ds_kll_pmf_as_string() function

This is the support for Probabilistic Mass Function (PMF) from Apache
DataSketches KLL algorithm collection. It receives a serialized KLL
sketch and one or more float values to represent ranges in the
sketched values.
E.g. [1, 5, 10] will mean the following ranges:
(-inf, 1), [1, 5), [5, 10), [10, +inf)
Returns a comma separated string where each value in the string is a
number in the range of [0,1] and shows that what percentage of the
data is in the particular ranges.

Note, ds_kll_pmf() should return an Array of doubles as the result but
with that we have to wait for the complex type support. Until, we
provide ds_kll_pmf_as_string() that can be deprecated once we
have array support. Tracking Jira for returning complex types from
functions is IMPALA-9520.

Example:
select ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10)
from alltypes;
+----------------------------------------------------------+
| ds_kll_pmf_as_string(ds_kll_sketch(float_col), 2, 4, 10) |
+----------------------------------------------------------+
| 0.202192,0.199452,0.598356,0                             |
+----------------------------------------------------------+

Change-Id: I222402f2dce2f49ab2b3f6e81a709da5539293ba
Reviewed-on: http://gerrit.cloudera.org:8080/16336
Reviewed-by: Gabor Kaszab <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Implement ds_kll_pmf() function
> -------------------------------
>
>                 Key: IMPALA-10019
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10019
>             Project: IMPALA
>          Issue Type: New Feature
>            Reporter: Gabor Kaszab
>            Assignee: Gabor Kaszab
>            Priority: Major
>             Fix For: Impala 4.0
>
>
> Requirements for ds_kll_pmf() (Probability Mass Function):
>  - Receives a serialized KLL sketch in BINARY type (in Impala it can be 
> STRING as long as we don't have BINARY) as first parameter.
>  - Receives one or more float values to create ranges from the sketched data.
>  - In Hive the return type is an array of doubles. However, Impala can't 
> return complex types from functions at this point so we have to find some 
> alternative approaches to implement this function. Follow whatever solution 
> came up inĀ https://issues.apache.org/jira/browse/IMPALA-9962
> An example:
> {code:java}
> select ds_kll_pmf(sketch_col, 1, 2, 3, 4) from sketches_table;
> {code}
> This will generate the following ranges: (-inf, 1), [1,2), [2,3), [3,4), 
> [4,+inf)
>  In Hive, the result would have an array of 5 doubles for the 5 ranges, where 
> each number gives the probability between [0,1] that an item will fall into 
> the particular range. Or in other words a ratio of items belonging to that 
> range.
> Taking input values such as: 1,2,3,4,5
> {code:java}
> select ds_kll_pmf(f, 1, 3, 4, 5, 10) from kll_sketches;
> +----------------------------+
> |            _c0             |
> +----------------------------+
> | [0.0,0.4,0.2,0.2,0.2,0.0]  |
> +----------------------------+
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to