[jira] [Commented] (HIVE-12094) nDV of aggregate columns tend to be log scale - not unique

Pengcheng Xiong (JIRA) Mon, 12 Oct 2015 10:08:34 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-12094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14953387#comment-14953387
 ]


Pengcheng Xiong commented on HIVE-12094:
----------------------------------------

[~gopalv], yes, the stats estimation for aggregation is very rough right now.

> nDV of aggregate columns tend to be log scale - not unique
> ----------------------------------------------------------
>
>                 Key: HIVE-12094
>                 URL: https://issues.apache.org/jira/browse/HIVE-12094
>             Project: Hive
>          Issue Type: Improvement
>          Components: Statistics
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Gopal V
>
> Stats for aggregate columns do not process properly if declared as a simple 
> nDV
> {code}
> select count(distinct l_suppkey) from lineitem group by l_orderkey having 
> count(distinct l_suppkey)  = 1
> {code}
> will mis-estimate the cardinality of the output by a significant margin.
> The log-scale of the nDV in general skews towards a very low number, which is 
> not accounted for in the StatsRulesProcFactory.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12094) nDV of aggregate columns tend to be log scale - not unique

Reply via email to