[jira] [Created] (TRAFODION-2359) Unusually large UEC in last histogram interval for divisioning column

David Wayne Birdsall (JIRA) Thu, 17 Nov 2016 10:45:53 -0800

David Wayne Birdsall created TRAFODION-2359:
-----------------------------------------------


             Summary: Unusually large UEC in last histogram interval for 
divisioning column
                 Key: TRAFODION-2359
                 URL: https://issues.apache.org/jira/browse/TRAFODION-2359
             Project: Apache Trafodion
          Issue Type: Bug
          Components: sql-cmp
    Affects Versions: any
         Environment: All
            Reporter: David Wayne Birdsall
            Assignee: David Wayne Birdsall


On a large table with a divisioning column, it was noticed that occasionally 
UPDATE STATISTICS using sampling produces an unusually large UEC in the last 
interval of the histogram. The circumstances under which this happened was that 
the last divisioning value had just one row and, by luck of the draw, that row 
happened to be included in the sample.

This can be reproduced by manually creating such a sample (and using CQD 
USTAT_SAMPLE_TABLE_NAME to cause UPDATE STATS to use this sample). For example, 
run the following three scripts:

Script ddl.sql:


drop table if exists weird_table;

CREATE TABLE weird_table
  (
    TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
      DROPPABLE NOT SERIALIZED
  , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED
  , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED
  , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED
  , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
  )
  SALT USING 4 PARTITIONS
       ON (TIME_STAMP)
  DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
     NAMED AS ("_DIVISION_1_"))
 ATTRIBUTES ALIGNED FORMAT
  HBASE_OPTIONS
  (
    DATA_BLOCK_ENCODING = 'FAST_DIFF',
    COMPRESSION = 'GZ',
    MEMSTORE_FLUSH_SIZE = '1073741824'
  )
;

drop table if exists weird_sample;

CREATE TABLE weird_sample
  (
    TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
      DROPPABLE NOT SERIALIZED
  , GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED
  , COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED
  , CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
      NOT SERIALIZED
  , PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
  )
  SALT USING 4 PARTITIONS
       ON (TIME_STAMP)
  DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
     NAMED AS ("_DIVISION_1_"))
 ATTRIBUTES ALIGNED FORMAT
  HBASE_OPTIONS
  (
    DATA_BLOCK_ENCODING = 'FAST_DIFF',
    COMPRESSION = 'GZ',
    MEMSTORE_FLUSH_SIZE = '1073741824'
  )
;

Script populate.sql:

-- populate the table in such a way that the divisioning column
-- has a few very frequent values but just one infrequent value

upsert using load into weird_table
select timestamp '1970-01-01 00:00:00' + cast(c0+10*c1+100*c2+1000*c3+10000*c4 
as interval minute(8)),
       3*c0 + 7*c4,
       c4,
       11*c2 + 2*c3 + 3*c1
from (values(1)) t
transpose 0,1,2,3,4,5,6,7,8,9 as c0
transpose 0,1,2,3,4,5,6,7,8,9 as c1
transpose 0,1,2,3,4,5,6,7,8,9 as c2
transpose 0,1,2,3,4,5,6,7,8,9 as c3
transpose 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 as c4;

insert into weird_table select max(TIME_STAMP) + interval '1' day, 999, 999, 
999 from weird_table;

upsert using load into weird_sample select * from weird_table sample random 10 
percent;

upsert into weird_sample select * from weird_table where time_stamp = (select 
max(time_stamp) from weird_table);

Script repro.sql:

cqd USTAT_SAMPLE_TABLE_NAME 'trafodion.sch.weird_sample' ; 

-- below, use the number of rows in weird_sample
update statistics for table weird_table on "_DIVISION_1_" sample 20080 rows; 

showstats for table weird_table on "_DIVISION_1_" detail;

In this particular example, the "showstats" showed a UEC of 9 for the last 
interval. More dramatic values can be obtained by increasing the table size 
relative to sample size; from the examples I have seen the UEC of the last 
interval seems to be about equal to the ratio of table size to sample size.



 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (TRAFODION-2359) Unusually large UEC in last histogram interval for divisioning column

Reply via email to