[ 
https://issues.apache.org/jira/browse/TRAFODION-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15684157#comment-15684157
 ] 

David Wayne Birdsall commented on TRAFODION-2359:
-------------------------------------------------

I implemented the solution suggested above and it seems to work well.

> Unusually large UEC in last histogram interval for divisioning column
> ---------------------------------------------------------------------
>
>                 Key: TRAFODION-2359
>                 URL: https://issues.apache.org/jira/browse/TRAFODION-2359
>             Project: Apache Trafodion
>          Issue Type: Bug
>          Components: sql-cmp
>    Affects Versions: any
>         Environment: All
>            Reporter: David Wayne Birdsall
>            Assignee: David Wayne Birdsall
>
> On a large table with a divisioning column, it was noticed that occasionally 
> UPDATE STATISTICS using sampling produces an unusually large UEC in the last 
> interval of the histogram. The circumstances under which this happened was 
> that the last divisioning value had just one row and, by luck of the draw, 
> that row happened to be included in the sample.
> This can be reproduced by manually creating such a sample (and using CQD 
> USTAT_SAMPLE_TABLE_NAME to cause UPDATE STATS to use this sample). For 
> example, run the following three scripts:
> Script ddl.sql:
> drop table if exists weird_table;
> CREATE TABLE weird_table
>   (
>     TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
>       DROPPABLE NOT SERIALIZED
>   , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
>   )
>   SALT USING 4 PARTITIONS
>        ON (TIME_STAMP)
>   DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
>      NAMED AS ("_DIVISION_1_"))
>  ATTRIBUTES ALIGNED FORMAT
>   HBASE_OPTIONS
>   (
>     DATA_BLOCK_ENCODING = 'FAST_DIFF',
>     COMPRESSION = 'GZ',
>     MEMSTORE_FLUSH_SIZE = '1073741824'
>   )
> ;
> drop table if exists weird_sample;
> CREATE TABLE weird_sample
>   (
>     TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
>       DROPPABLE NOT SERIALIZED
>   , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
>       NOT SERIALIZED
>   , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
>   )
>   SALT USING 4 PARTITIONS
>        ON (TIME_STAMP)
>   DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
>      NAMED AS ("_DIVISION_1_"))
>  ATTRIBUTES ALIGNED FORMAT
>   HBASE_OPTIONS
>   (
>     DATA_BLOCK_ENCODING = 'FAST_DIFF',
>     COMPRESSION = 'GZ',
>     MEMSTORE_FLUSH_SIZE = '1073741824'
>   )
> ;
> Script populate.sql:
> -- populate the table in such a way that the divisioning column
> -- has a few very frequent values but just one infrequent value
> upsert using load into weird_table
> select timestamp '1970-01-01 00:00:00' + 
> cast(c0+10*c1+100*c2+1000*c3+10000*c4 as interval minute(8)),
>        3*c0 + 7*c4,
>        c4,
>        11*c2 + 2*c3 + 3*c1
> from (values(1)) t
> transpose 0,1,2,3,4,5,6,7,8,9 as c0
> transpose 0,1,2,3,4,5,6,7,8,9 as c1
> transpose 0,1,2,3,4,5,6,7,8,9 as c2
> transpose 0,1,2,3,4,5,6,7,8,9 as c3
> transpose 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 as c4;
> insert into weird_table select max(TIME_STAMP) + interval '1' day, 999, 999, 
> 999 from weird_table;
> upsert using load into weird_sample select * from weird_table sample random 
> 10 percent;
> upsert into weird_sample select * from weird_table where time_stamp = (select 
> max(time_stamp) from weird_table);
> Script repro.sql:
> cqd USTAT_SAMPLE_TABLE_NAME 'trafodion.sch.weird_sample' ; 
> -- below, use the number of rows in weird_sample
> update statistics for table weird_table on "_DIVISION_1_" sample 20080 rows; 
> showstats for table weird_table on "_DIVISION_1_" detail;
> In this particular example, the "showstats" showed a UEC of 9 for the last 
> interval. More dramatic values can be obtained by increasing the table size 
> relative to sample size; from the examples I have seen the UEC of the last 
> interval seems to be about equal to the ratio of table size to sample size.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to