[
https://issues.apache.org/jira/browse/TRAFODION-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684163#comment-15684163
]
ASF GitHub Bot commented on TRAFODION-2359:
-------------------------------------------
GitHub user DaveBirdsall opened a pull request:
https://github.com/apache/incubator-trafodion/pull/844
[TRAFODION-2359] Smooth out UEC for last interval when last value is …
…scarce.
Details in the JIRA.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/DaveBirdsall/incubator-trafodion Temp1
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/incubator-trafodion/pull/844.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #844
----
commit 876ffec3aa2d36ac657522d46d22c1d1ae26ec3c
Author: Dave Birdsall <[email protected]>
Date: 2016-11-21T17:21:33Z
[TRAFODION-2359] Smooth out UEC for last interval when last value is scarce
----
> Unusually large UEC in last histogram interval for divisioning column
> ---------------------------------------------------------------------
>
> Key: TRAFODION-2359
> URL: https://issues.apache.org/jira/browse/TRAFODION-2359
> Project: Apache Trafodion
> Issue Type: Bug
> Components: sql-cmp
> Affects Versions: any
> Environment: All
> Reporter: David Wayne Birdsall
> Assignee: David Wayne Birdsall
>
> On a large table with a divisioning column, it was noticed that occasionally
> UPDATE STATISTICS using sampling produces an unusually large UEC in the last
> interval of the histogram. The circumstances under which this happened was
> that the last divisioning value had just one row and, by luck of the draw,
> that row happened to be included in the sample.
> This can be reproduced by manually creating such a sample (and using CQD
> USTAT_SAMPLE_TABLE_NAME to cause UPDATE STATS to use this sample). For
> example, run the following three scripts:
> Script ddl.sql:
> drop table if exists weird_table;
> CREATE TABLE weird_table
> (
> TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
> DROPPABLE NOT SERIALIZED
> , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
> NOT SERIALIZED
> , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
> NOT SERIALIZED
> , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
> NOT SERIALIZED
> , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
> )
> SALT USING 4 PARTITIONS
> ON (TIME_STAMP)
> DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
> NAMED AS ("_DIVISION_1_"))
> ATTRIBUTES ALIGNED FORMAT
> HBASE_OPTIONS
> (
> DATA_BLOCK_ENCODING = 'FAST_DIFF',
> COMPRESSION = 'GZ',
> MEMSTORE_FLUSH_SIZE = '1073741824'
> )
> ;
> drop table if exists weird_sample;
> CREATE TABLE weird_sample
> (
> TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
> DROPPABLE NOT SERIALIZED
> , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
> NOT SERIALIZED
> , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
> NOT SERIALIZED
> , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
> NOT SERIALIZED
> , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
> )
> SALT USING 4 PARTITIONS
> ON (TIME_STAMP)
> DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
> NAMED AS ("_DIVISION_1_"))
> ATTRIBUTES ALIGNED FORMAT
> HBASE_OPTIONS
> (
> DATA_BLOCK_ENCODING = 'FAST_DIFF',
> COMPRESSION = 'GZ',
> MEMSTORE_FLUSH_SIZE = '1073741824'
> )
> ;
> Script populate.sql:
> -- populate the table in such a way that the divisioning column
> -- has a few very frequent values but just one infrequent value
> upsert using load into weird_table
> select timestamp '1970-01-01 00:00:00' +
> cast(c0+10*c1+100*c2+1000*c3+10000*c4 as interval minute(8)),
> 3*c0 + 7*c4,
> c4,
> 11*c2 + 2*c3 + 3*c1
> from (values(1)) t
> transpose 0,1,2,3,4,5,6,7,8,9 as c0
> transpose 0,1,2,3,4,5,6,7,8,9 as c1
> transpose 0,1,2,3,4,5,6,7,8,9 as c2
> transpose 0,1,2,3,4,5,6,7,8,9 as c3
> transpose 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 as c4;
> insert into weird_table select max(TIME_STAMP) + interval '1' day, 999, 999,
> 999 from weird_table;
> upsert using load into weird_sample select * from weird_table sample random
> 10 percent;
> upsert into weird_sample select * from weird_table where time_stamp = (select
> max(time_stamp) from weird_table);
> Script repro.sql:
> cqd USTAT_SAMPLE_TABLE_NAME 'trafodion.sch.weird_sample' ;
> -- below, use the number of rows in weird_sample
> update statistics for table weird_table on "_DIVISION_1_" sample 20080 rows;
> showstats for table weird_table on "_DIVISION_1_" detail;
> In this particular example, the "showstats" showed a UEC of 9 for the last
> interval. More dramatic values can be obtained by increasing the table size
> relative to sample size; from the examples I have seen the UEC of the last
> interval seems to be about equal to the ratio of table size to sample size.
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)