Aman Sinha created DRILL-7228:
---------------------------------

             Summary: Histogram end points show high deviation for a sample 
data set
                 Key: DRILL-7228
                 URL: https://issues.apache.org/jira/browse/DRILL-7228
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.16.0
            Reporter: Aman Sinha
            Assignee: Aman Sinha
             Fix For: 1.17.0


There are couple of scenarios where the histogram bucket end points show high 
deviation for the attached sample data set. 

+Scenario 1: +
There are total 100 rows in the sample. Here are the first 10 values of the 
c_float column ordered by the column.  

{noformat}
select c_float from `table_stats/alltypes_with_nulls` order by c_float;
+--------------+
|   c_float    |
+--------------+
| -4.6873795E9 |
| 8.1855632E7  |
| 2.65311632E8 |
| 4.50677952E8 |
| 4.6864464E8  |
| 5.7848493E8  |
| 6.6793114E8  |
| 7.1175571E8  |
| 9.0065581E8  |
| 9.2245773E8  |
...
...
<100 rows>
{noformat}

Here the minimum value is a small negative number.  Here's the output of the 
histogram after running ANALYZE command: 

{noformat}
 "buckets" : [ 8.1855488E7, 9.13736816E8, 1.7208630111999998E9, 3.2401755232E9, 
4.6546719328E9, 5.130497904E9, 5.9901393504E9, 6.779930992E9, 7.998626672E9, 
8.691596143999998E9, 9.983783792E9 ]
{noformat}

Note that the starting end point of bucket 0 is actually the 2nd value in the 
ordered list and the small negative number is not represented in the histogram 
at all. 

+Scenario 2:+
Histogram for the c_bigint column is as below: 
{noformat}
 {
      "column" : "`c_bigint`",
      "majortype" : {
        "type" : "BIGINT",
        "mode" : "OPTIONAL"
      },
      "schema" : 1.0,
      "rowcount" : 100.0,
      "nonnullrowcount" : 87.0,
      "ndv" : 46,
      "avgwidth" : 8.0,
      "histogram" : {
        "category" : "numeric-equi-depth",
        "numRowsPerBucket" : 8,
        "buckets" : [ -8.6390506354062131E18, -7.679478802017577E18, 
-5.8389791200382024E18, -2.9165328693138038E18, -1.77746633649836621E18, 
2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 2.83467841536E11, 
8.848383132345303E17, 4.6441480083157811E18 ]
      }
    }
{noformat}

This indicates that there are duplicate rows with the value close to 2.83 which 
is not true when we analyze the source data.
This is the output of the ntile function:

{noformat}
SELECT bucket_num,
 min(c_bigint) as min_amount,
max(c_bigint) as max_amount,
 count(*) as total_count
   FROM (
     SELECT c_bigint,
         NTILE(10) OVER (ORDER BY c_bigint) as bucket_num
         FROM `table_stats/alltypes_with_nulls`
  )
GROUP BY bucket_num
 ORDER BY bucket_num;
+------------+----------------------+----------------------+-------------+
| bucket_num |      min_amount      |      max_amount      | total_count |
+------------+----------------------+----------------------+-------------+
| 1          | -8804872880253829120 | -6983033704176156672 | 10          |
| 2          | -6772904422084182016 | -5326061597989273600 | 10          |
| 3          | -5111449881868763136 | -2561061038367703040 | 10          |
| 4          | -2424523650070740992 | -449093763428515840  | 10          |
| 5          | 0                    | 0                    | 10          |
| 6          | 0                    | 0                    | 10          |
| 7          | 0                    | 0                    | 10          |
| 8          | 0                    | 884838034226544640   | 10          |
| 9          | 884838034226544640   | 4644147690488201216  | 10          |
| 10         | null                 | null                 | 10          |
+------------+----------------------+----------------------+-------------+
{noformat}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to