[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13644840#comment-13644840
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
----------------------------------------------

The fix is to use hash functions that are pairwise independent. More on 
pairwise independence and family of hash functions - 
http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf
                
> Column stats: Distinct value estimator should use hash functions that are 
> pairwise independent
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-4435
>                 URL: https://issues.apache.org/jira/browse/HIVE-4435
>             Project: Hive
>          Issue Type: Bug
>          Components: Statistics
>    Affects Versions: 0.10.0
>            Reporter: Shreepadma Venugopalan
>            Assignee: Shreepadma Venugopalan
>         Attachments: HIVE-4435.1.patch
>
>
> The current implementation of Flajolet-Martin estimator to estimate the 
> number of distinct values doesn't use hash functions that are pairwise 
> independent. This is problematic because the input values don't distribute 
> uniformly. When run on large TPC-H data sets, this leads to a huge 
> discrepancy for primary key columns. Primary key columns are typically a 
> monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to