Aman Sinha created DRILL-7228:
---------------------------------
Summary: Histogram end points show high deviation for a sample
data set
Key: DRILL-7228
URL: https://issues.apache.org/jira/browse/DRILL-7228
Project: Apache Drill
Issue Type: Bug
Affects Versions: 1.16.0
Reporter: Aman Sinha
Assignee: Aman Sinha
Fix For: 1.17.0
There are couple of scenarios where the histogram bucket end points show high
deviation for the attached sample data set.
+Scenario 1: +
There are total 100 rows in the sample. Here are the first 10 values of the
c_float column ordered by the column.
{noformat}
select c_float from `table_stats/alltypes_with_nulls` order by c_float;
+--------------+
| c_float |
+--------------+
| -4.6873795E9 |
| 8.1855632E7 |
| 2.65311632E8 |
| 4.50677952E8 |
| 4.6864464E8 |
| 5.7848493E8 |
| 6.6793114E8 |
| 7.1175571E8 |
| 9.0065581E8 |
| 9.2245773E8 |
...
...
<100 rows>
{noformat}
Here the minimum value is a small negative number. Here's the output of the
histogram after running ANALYZE command:
{noformat}
"buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9,
4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9,
8.691596143999998E9, 9.983783792E9 ]
{noformat}
Note that the starting end point of bucket 0 is actually the 2nd value in the
ordered list and the small negative number is not represented in the histogram
at all.
+Scenario 2:+
Histogram for the c_bigint column is as below:
{noformat}
{
"column" : "`c_bigint`",
"majortype" : {
"type" : "BIGINT",
"mode" : "OPTIONAL"
},
"schema" : 1.0,
"rowcount" : 100.0,
"nonnullrowcount" : 87.0,
"ndv" : 46,
"avgwidth" : 8.0,
"histogram" : {
"category" : "numeric-equi-depth",
"numRowsPerBucket" : 8,
"buckets" : [ -8.6390506354062131E18, -7.679478802017577E18,
-5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18,
2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11,
8.848383132345303E17, 4.6441480083157811E18 ]
}
}
{noformat}
This indicates that there are duplicate rows with the value close to 2.83 which
is not true when we analyze the source data.
This is the output of the ntile function:
{noformat}
SELECT bucket_num,
min(c_bigint) as min_amount,
max(c_bigint) as max_amount,
count(*) as total_count
FROM (
SELECT c_bigint,
NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
FROM `table_stats/alltypes_with_nulls`
)
GROUP BY bucket_num
ORDER BY bucket_num;
+------------+----------------------+----------------------+-------------+
| bucket_num | min_amount | max_amount | total_count |
+------------+----------------------+----------------------+-------------+
| 1 | -8804872880253829120 | -6983033704176156672 | 10 |
| 2 | -6772904422084182016 | -5326061597989273600 | 10 |
| 3 | -5111449881868763136 | -2561061038367703040 | 10 |
| 4 | -2424523650070740992 | -449093763428515840 | 10 |
| 5 | 0 | 0 | 10 |
| 6 | 0 | 0 | 10 |
| 7 | 0 | 0 | 10 |
| 8 | 0 | 884838034226544640 | 10 |
| 9 | 884838034226544640 | 4644147690488201216 | 10 |
| 10 | null | null | 10 |
+------------+----------------------+----------------------+-------------+
{noformat}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)