David Wayne Birdsall created TRAFODION-2359:
-----------------------------------------------
Summary: Unusually large UEC in last histogram interval for
divisioning column
Key: TRAFODION-2359
URL: https://issues.apache.org/jira/browse/TRAFODION-2359
Project: Apache Trafodion
Issue Type: Bug
Components: sql-cmp
Affects Versions: any
Environment: All
Reporter: David Wayne Birdsall
Assignee: David Wayne Birdsall
On a large table with a divisioning column, it was noticed that occasionally
UPDATE STATISTICS using sampling produces an unusually large UEC in the last
interval of the histogram. The circumstances under which this happened was that
the last divisioning value had just one row and, by luck of the draw, that row
happened to be included in the sample.
This can be reproduced by manually creating such a sample (and using CQD
USTAT_SAMPLE_TABLE_NAME to cause UPDATE STATS to use this sample). For example,
run the following three scripts:
Script ddl.sql:
drop table if exists weird_table;
CREATE TABLE weird_table
(
TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
DROPPABLE NOT SERIALIZED
, GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
NOT SERIALIZED
, COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
NOT SERIALIZED
, CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
NOT SERIALIZED
, PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
)
SALT USING 4 PARTITIONS
ON (TIME_STAMP)
DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
NAMED AS ("_DIVISION_1_"))
ATTRIBUTES ALIGNED FORMAT
HBASE_OPTIONS
(
DATA_BLOCK_ENCODING = 'FAST_DIFF',
COMPRESSION = 'GZ',
MEMSTORE_FLUSH_SIZE = '1073741824'
)
;
drop table if exists weird_sample;
CREATE TABLE weird_sample
(
TIME_STAMP TIMESTAMP(6) NO DEFAULT NOT NULL NOT
DROPPABLE NOT SERIALIZED
, GROUPID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
NOT SERIALIZED
, COMPANYID LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
NOT SERIALIZED
, CHECKSUM LARGEINT NO DEFAULT NOT NULL NOT DROPPABLE
NOT SERIALIZED
, PRIMARY KEY (COMPANYID ASC, GROUPID ASC, TIME_STAMP DESC, CHECKSUM ASC)
)
SALT USING 4 PARTITIONS
ON (TIME_STAMP)
DIVISION BY (DATE_TRUNC('DAY',TIME_STAMP) DESC
NAMED AS ("_DIVISION_1_"))
ATTRIBUTES ALIGNED FORMAT
HBASE_OPTIONS
(
DATA_BLOCK_ENCODING = 'FAST_DIFF',
COMPRESSION = 'GZ',
MEMSTORE_FLUSH_SIZE = '1073741824'
)
;
Script populate.sql:
-- populate the table in such a way that the divisioning column
-- has a few very frequent values but just one infrequent value
upsert using load into weird_table
select timestamp '1970-01-01 00:00:00' + cast(c0+10*c1+100*c2+1000*c3+10000*c4
as interval minute(8)),
3*c0 + 7*c4,
c4,
11*c2 + 2*c3 + 3*c1
from (values(1)) t
transpose 0,1,2,3,4,5,6,7,8,9 as c0
transpose 0,1,2,3,4,5,6,7,8,9 as c1
transpose 0,1,2,3,4,5,6,7,8,9 as c2
transpose 0,1,2,3,4,5,6,7,8,9 as c3
transpose 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19 as c4;
insert into weird_table select max(TIME_STAMP) + interval '1' day, 999, 999,
999 from weird_table;
upsert using load into weird_sample select * from weird_table sample random 10
percent;
upsert into weird_sample select * from weird_table where time_stamp = (select
max(time_stamp) from weird_table);
Script repro.sql:
cqd USTAT_SAMPLE_TABLE_NAME 'trafodion.sch.weird_sample' ;
-- below, use the number of rows in weird_sample
update statistics for table weird_table on "_DIVISION_1_" sample 20080 rows;
showstats for table weird_table on "_DIVISION_1_" detail;
In this particular example, the "showstats" showed a UEC of 9 for the last
interval. More dramatic values can be obtained by increasing the table size
relative to sample size; from the examples I have seen the UEC of the last
interval seems to be about equal to the ratio of table size to sample size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)