[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-06-07 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677850#comment-13677850
 ] 

Hudson commented on HIVE-4435:
--

Integrated in Hive-trunk-h0.21 #2132 (See 
[https://builds.apache.org/job/Hive-trunk-h0.21/2132/])
HIVE-4435 : Column stats: Distinct value estimator should use hash 
functions that are pairwise independent (Shreepadma Venugopalan via Ashutosh 
Chauhan) (Revision 1490323)

 Result = FAILURE
hashutosh : 
http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1490323
Files : 
* 
/hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java
* /hive/trunk/ql/src/test/results/clientpositive/compute_stats_double.q.out
* /hive/trunk/ql/src/test/results/clientpositive/compute_stats_long.q.out
* /hive/trunk/ql/src/test/results/clientpositive/compute_stats_string.q.out


 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0, 0.11.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Fix For: 0.12.0

 Attachments: chart_1(1).png, HIVE-4435.1.patch, HIVE-4435.2.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-06-05 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676365#comment-13676365
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

[~ashutoshc]: I've updated the .q files in the patches. Thanks!

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0, 0.11.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch, HIVE-4435.2.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673595#comment-13673595
 ] 

Ashutosh Chauhan commented on HIVE-4435:


Sorry for the delay. +1 Will commit if tests pass.

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-06-03 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673599#comment-13673599
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

Thanks Ashutosh!

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-06-03 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673986#comment-13673986
 ] 

Ashutosh Chauhan commented on HIVE-4435:


Following tests failed: 
* compute_stats_double.q
* compute_stats_long.q
* compute_stats_string.q

I am assuming since we have better estimates now, we just need to update .q.out 
files for these. Can you verify and if so can you update the patch with it?

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-05-28 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668883#comment-13668883
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

Ping :)

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-05-03 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648987#comment-13648987
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

Can a committer take a look at this?

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-04-29 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644840#comment-13644840
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

The fix is to use hash functions that are pairwise independent. More on 
pairwise independence and family of hash functions - 
http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-04-29 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644844#comment-13644844
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

review board: https://reviews.apache.org/r/10841/

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent

2013-04-29 Thread Shreepadma Venugopalan (JIRA)

[ 
https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644850#comment-13644850
 ] 

Shreepadma Venugopalan commented on HIVE-4435:
--

Attached plot of relative error vs. number of distinct values after the fix. 
Dataset: TPC-H of varying sizes up to 10TB
hive.stats.ndv.error = 5% (standard error for the estimator)
Column types: String, Long, Double

 Column stats: Distinct value estimator should use hash functions that are 
 pairwise independent
 --

 Key: HIVE-4435
 URL: https://issues.apache.org/jira/browse/HIVE-4435
 Project: Hive
  Issue Type: Bug
  Components: Statistics
Affects Versions: 0.10.0
Reporter: Shreepadma Venugopalan
Assignee: Shreepadma Venugopalan
 Attachments: chart_1(1).png, HIVE-4435.1.patch


 The current implementation of Flajolet-Martin estimator to estimate the 
 number of distinct values doesn't use hash functions that are pairwise 
 independent. This is problematic because the input values don't distribute 
 uniformly. When run on large TPC-H data sets, this leads to a huge 
 discrepancy for primary key columns. Primary key columns are typically a 
 monotonically increasing sequence.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira