[jira] [Comment Edited] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)

Aman Sinha (JIRA) Sat, 23 Mar 2019 11:17:25 -0700


    [ 
https://issues.apache.org/jira/browse/DRILL-7117?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799742#comment-16799742
 ]


Aman Sinha edited comment on DRILL-7117 at 3/23/19 6:16 PM:
------------------------------------------------------------

I have a work-in-progress implementation and initial results for the histogram 
using the proposed design (see design doc in parent JIRA) are encouraging.  I 
analyzed a few columns from the TPC-DS table Web_Sales and in particular for a 
skewed column ws_ext_tax  I compared the output of the NTILE() function with 
what is generated by the histogram implementation.  They are very close. 

Here's the NTILE() output (this is the baseline): 
{noformat}

select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as 
count
    from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x  from 
dfs.tmp.web_sales1)
 group by x order by x;

|ntile  | min   |   max    | count  |
| 1      | 0.0     | 0.0      | 71939  |
| 2      | 0.0     | 3.05     | 71939  |
| 3      | 3.05    | 9.28     | 71939  |
| 4      | 9.28    | 19.24    | 71939  |
| 5      | 19.24   | 34.51    | 71938  |
| 6      | 34.51   | 57.96    | 71938  |
| 7      | 57.96   | 95.01    | 71938  |
| 8      | 95.01   | 158.71   | 71938  |
| 9      | 158.71  | 292.44   | 71938  |
| 10     | 292.44  | 2433.41  | 71938  |

{noformat}

Here's the Histogram output after running ANALYZE and using 10 buckets: 

{noformat}
{
      "column" : "`ws_ext_tax`",
      "majortype" : {
        "type" : "FLOAT4",
        "mode" : "OPTIONAL"
      },
      "schema" : 1.0,
      "rowcount" : 719384.0,
      "nonnullrowcount" : 719200.0,
      "ndv" : 68400,
      "avgwidth" : 4.0,
      "histogram" : {
        "category" : "numeric-histogram",
        "numRowsPerBucket" : 71920,
        "buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 
19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 
158.88263313484484, 289.2295684860717, 24
33.409910105169 ]
      }
    }
{noformat}

Note that the bucket boundaries closely align with the baseline and there's 
non-uniformity in the data distribution. 


was (Author: amansinha100):
I have a work-in-progress implementation and initial results for the histogram 
using the proposed design (see design doc in parent JIRA) are encouraging.  I 
analyzed a few columns from the TPC-DS table Web_Sales and in particular for a 
skewed column ws_ext_tax  I compared the output of the NTILE() function with 
what is generated by the histogram implementation.  They are very close. 

Here's the NTILE() output (this is the baseline): 
{noformat}

||Heading 1||Heading 2||
|Col A1|Col A2|

select x as ntile, min(ws_ext_tax) as min, max(ws_ext_tax) as max, count(*) as 
count
    from (select ws_ext_tax, ntile(10) over (order by ws_ext_tax) as x  from 
dfs.tmp.web_sales1)
 group by x order by x;

||ntile  || min   ||   max    || count  ||
| 1      | 0.0     | 0.0      | 71939  |
| 2      | 0.0     | 3.05     | 71939  |
| 3      | 3.05    | 9.28     | 71939  |
| 4      | 9.28    | 19.24    | 71939  |
| 5      | 19.24   | 34.51    | 71938  |
| 6      | 34.51   | 57.96    | 71938  |
| 7      | 57.96   | 95.01    | 71938  |
| 8      | 95.01   | 158.71   | 71938  |
| 9      | 158.71  | 292.44   | 71938  |
| 10     | 292.44  | 2433.41  | 71938  |

{noformat}

Here's the Histogram output after running ANALYZE and using 10 buckets: 

{noformat}
{
      "column" : "`ws_ext_tax`",
      "majortype" : {
        "type" : "FLOAT4",
        "mode" : "OPTIONAL"
      },
      "schema" : 1.0,
      "rowcount" : 719384.0,
      "nonnullrowcount" : 719200.0,
      "ndv" : 68400,
      "avgwidth" : 4.0,
      "histogram" : {
        "category" : "numeric-histogram",
        "numRowsPerBucket" : 71920,
        "buckets" : [ 0.0, 0.0, 3.0012382450412556, 9.294927927208075, 
19.451534502867155, 34.26953338960548, 57.30967536502495, 95.25307998787451, 
158.88263313484484, 289.2295684860717, 24
33.409910105169 ]
      }
    }
{noformat}

Note that the bucket boundaries closely align with the baseline and there's 
non-uniformity in the data distribution. 

> Support creation of histograms for numeric data types (except Decimal)
> ----------------------------------------------------------------------
>
>                 Key: DRILL-7117
>                 URL: https://issues.apache.org/jira/browse/DRILL-7117
>             Project: Apache Drill
>          Issue Type: Sub-task
>          Components: Query Planning &amp; Optimization
>            Reporter: Aman Sinha
>            Assignee: Aman Sinha
>            Priority: Major
>             Fix For: 1.16.0
>
>
> This JIRA is specific to creating histograms for numeric data types: INT, 
> BIGINT, FLOAT4, FLOAT8  and their corresponding nullable/non-nullable 
> versions. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (DRILL-7117) Support creation of histograms for numeric data types (except Decimal)

Reply via email to