[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13677850#comment-13677850 ] Hudson commented on HIVE-4435: -- Integrated in Hive-trunk-h0.21 #2132 (See [https://builds.apache.org/job/Hive-trunk-h0.21/2132/]) HIVE-4435 : Column stats: Distinct value estimator should use hash functions that are pairwise independent (Shreepadma Venugopalan via Ashutosh Chauhan) (Revision 1490323) Result = FAILURE hashutosh : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVNview=revrev=1490323 Files : * /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/NumDistinctValueEstimator.java * /hive/trunk/ql/src/test/results/clientpositive/compute_stats_double.q.out * /hive/trunk/ql/src/test/results/clientpositive/compute_stats_long.q.out * /hive/trunk/ql/src/test/results/clientpositive/compute_stats_string.q.out Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0, 0.11.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Fix For: 0.12.0 Attachments: chart_1(1).png, HIVE-4435.1.patch, HIVE-4435.2.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13676365#comment-13676365 ] Shreepadma Venugopalan commented on HIVE-4435: -- [~ashutoshc]: I've updated the .q files in the patches. Thanks! Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0, 0.11.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch, HIVE-4435.2.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673595#comment-13673595 ] Ashutosh Chauhan commented on HIVE-4435: Sorry for the delay. +1 Will commit if tests pass. Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673599#comment-13673599 ] Shreepadma Venugopalan commented on HIVE-4435: -- Thanks Ashutosh! Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13673986#comment-13673986 ] Ashutosh Chauhan commented on HIVE-4435: Following tests failed: * compute_stats_double.q * compute_stats_long.q * compute_stats_string.q I am assuming since we have better estimates now, we just need to update .q.out files for these. Can you verify and if so can you update the patch with it? Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13668883#comment-13668883 ] Shreepadma Venugopalan commented on HIVE-4435: -- Ping :) Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13648987#comment-13648987 ] Shreepadma Venugopalan commented on HIVE-4435: -- Can a committer take a look at this? Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644840#comment-13644840 ] Shreepadma Venugopalan commented on HIVE-4435: -- The fix is to use hash functions that are pairwise independent. More on pairwise independence and family of hash functions - http://people.csail.mit.edu/ronitt/COURSE/S12/handouts/lec5.pdf Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644844#comment-13644844 ] Shreepadma Venugopalan commented on HIVE-4435: -- review board: https://reviews.apache.org/r/10841/ Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (HIVE-4435) Column stats: Distinct value estimator should use hash functions that are pairwise independent
[ https://issues.apache.org/jira/browse/HIVE-4435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13644850#comment-13644850 ] Shreepadma Venugopalan commented on HIVE-4435: -- Attached plot of relative error vs. number of distinct values after the fix. Dataset: TPC-H of varying sizes up to 10TB hive.stats.ndv.error = 5% (standard error for the estimator) Column types: String, Long, Double Column stats: Distinct value estimator should use hash functions that are pairwise independent -- Key: HIVE-4435 URL: https://issues.apache.org/jira/browse/HIVE-4435 Project: Hive Issue Type: Bug Components: Statistics Affects Versions: 0.10.0 Reporter: Shreepadma Venugopalan Assignee: Shreepadma Venugopalan Attachments: chart_1(1).png, HIVE-4435.1.patch The current implementation of Flajolet-Martin estimator to estimate the number of distinct values doesn't use hash functions that are pairwise independent. This is problematic because the input values don't distribute uniformly. When run on large TPC-H data sets, this leads to a huge discrepancy for primary key columns. Primary key columns are typically a monotonically increasing sequence. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira