[
https://issues.apache.org/jira/browse/TRAFODION-2913?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16329117#comment-16329117
]
David Wayne Birdsall edited comment on TRAFODION-2913 at 1/17/18 6:11 PM:
--------------------------------------------------------------------------
Here is how I did the recalibration.
# I created a table and, using gdb in IndexDesc :: pruneMdam determined the
row size, block size and rows per block computed for that table.
# I created a sqlci script that would populate a similar table with 500,000
rows, then used that table to populate other tables of varying sizes: 1 block
(=4096 rows in this case), 10 blocks and 100 blocks. The latter tables have a
two-column key. The UEC of the first column was controlled by using the MOD
function when populating; the second column had unique values.
# I then ran a python script that runs a query against a table, forcing simple
scan plans and various MDAM plans, and capturing the elapsed execution time.
The script runs each query three times, capturing execution times for each.
(The first time tends to be larger because the table gets read into HBase
cache. The second and third times tend to benefit from the table already being
in cache. So the read time is amortized across the three executions.)
# The tables I created were single-partition, so I looked only at serial
simple scan and serial mdam 2 deep plan times. I noted the average execution
times for each, and noted also what the Optimizer was picking (using the old
Mdam costing code, which is the default choice today).
The inequality we are trying to tune in IndexDesc :: pruneMdam is UEC <= Blocks
* MDAM_SELECTION_DEFAULT, that is, UEC / Blocks <= MDAM_SELECTION_DEFAULT. So,
we want this inequality to hold true when MDAM plans win (or even if MDAM plans
are close to winning), and allow this inequality to be false when MDAM plans
are bad.
I ran this test with UECs of 2, 20, 200, 50, 500, 80 and 800. The inequality
above worked best when MDAM_SELECTION_DEFAULT was given a value of 8.0.
The table shape I used is below:
For the following:
CREATE TABLE TRAFODION.SCH.T2
(
A INT NO DEFAULT NOT NULL NOT DROPPABLE NOT
SERIALIZED
, B INT NO DEFAULT NOT NULL NOT DROPPABLE NOT
SERIALIZED
, C INT DEFAULT NULL NOT SERIALIZED
, PRIMARY KEY (A ASC, B ASC)
)
ATTRIBUTES ALIGNED FORMAT
;
In IndexDesc :: pruneMdam, I saw the following for block size, row length and
rows per block:
(gdb) p blockSize
$8 = \{dpv_ = 65536, static ovflwCount_ = 0, static udflwCount_ = 0}
(gdb) p recordLength
$9 = \{dpv_ = 16, static ovflwCount_ = 0, static udflwCount_ = 0}
(gdb) p recordsPerBlock
$10 = \{dpv_ = 4096, static ovflwCount_ = 0, static udflwCount_ = 0}
(gdb)
Below is the sqlci script I used to populate the tables. (I manually edited the
script for different UECs.)
drop table if exists t2sourceuec200;
drop table if exists t2blk1uec200;
drop table if exists t2blk10uec200;
drop table if exists t2blk100uec200;
create table t2sourceuec200(a int not null, b int not null, c int, primary key
(b,a)); – key reversed for efficient upserts below
upsert using load into t2sourceuec200
select
mod(x1+10*x2+100*x3+1000*x4+10000*x5+100000*x6, 200),
x1+10*x2+100*x3+1000*x4+10000*x5+100000*x6,
x1+10*x2+100*x3+1000*x4+10000*x5+100000*x6
from (values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) T(x1)
transpose 0,1,2,3,4,5,6,7,8,9 as x2
transpose 0,1,2,3,4,5,6,7,8,9 as x3
transpose 0,1,2,3,4,5,6,7,8,9 as x4
transpose 0,1,2,3,4,5,6,7,8,9 as x5
transpose 0,1,2,3,4 as x6;
create table t2blk1uec200 (a int not null, b int not null, c int, primary key
(a,b));
create table t2blk10uec200 like t2blk1uec200;
create table t2blk100uec200 like t2blk1uec200;
upsert using load into t2blk1uec200
select a,b,c from t2sourceuec200 where b < 4096;
upsert using load into t2blk10uec200
select a,b,c from t2sourceuec200 where b < 40960;
upsert using load into t2blk100uec200
select a,b,c from t2sourceuec200 where b < 409600;
update statistics for table t2blk1uec200 on every column;
update statistics for table t2blk10uec200 on every column;
update statistics for table t2blk100uec200 on every column;
The Python script I used to force plans and measure execution times is
MdamTests2.py, a copy of which is attached.
The Python script generates INSERT statements with plan and performance data,
which can be uploaded into tables created with the following DDL:
– Creates tables for data produced by MdamTests2.py script
?section tables
– The Results table has one row for each statement processed
– by MdamTests.py
drop table if exists results2 cascade;
create table results2
(
test_run numeric(17,6) not null,
plan_name varchar(28) not null,
statement_text varchar(4096) not null,
prepare_error varchar(256) not null,
execute_error varchar(256) not null,
fragment_type char(10) not null,
executor_preds varchar(1024) not null,
mdam_disjunct varchar(1024) not null,
begin_key varchar(1024) not null,
end_key varchar(1024) not null,
primary key (test_run)
)
salt using 4 partitions;
– The Resultstats table has one row for each execution of a
– statement processed by MdamTests2.py
– Can join it to Results on test_run
drop table if exists resultstats2 cascade;
create table resultstats2
(
test_run numeric(17,6) not null,
execution_instance smallint not null,
start_time timestamp(6) not null,
end_time timestamp(6) not null,
records_accessed largeint not null,
records_used largeint not null,
hdfs_ios largeint not null,
hdfs_iobytes largeint not null,
hdfs_usecs largeint not null,
primary key (test_run, execution_instance)
)
salt using 4 partitions;
?section views
drop view resultsview2 cascade;
CREATE VIEW resultsview2 AS
SELECT r.test_run,
CAST(r.STATEMENT_TEXT AS CHAR(72)) AS ST,
case when r.plan_name = 'Simple scan' then
case when r.fragment_type = 'master' then 'Serial simple scan'
else 'Parallel simple scan' end
else r.plan_name end as plan_name,
r.mdam_disjunct,
min(s.end_time - s.start_time) as mintime,
max(s.end_time - s.start_time) as maxtime,
avg(s.end_time - s.start_time) as avgtime,
min(s.records_accessed) as minaccessed,
max(s.records_accessed) as maxaccessed,
avg(s.records_accessed) as avgaccessed,
min(s.records_used) as minused,
max(s.records_used) as maxused,
avg(s.records_used) as avgused,
min(s.hdfs_ios) as minios,
max(s.hdfs_ios) as maxios,
avg(s.hdfs_ios) as avgios,
min(s.hdfs_iobytes) as miniobytes,
max(s.hdfs_iobytes) as maxiobytes,
avg(s.hdfs_iobytes) as avgiobytes,
min(s.hdfs_usecs) as minusecs,
max(s.hdfs_usecs) as maxusecs,
avg(s.hdfs_usecs) as avgusecs
FROM RESULTS2 r JOIN resultstats2 s on r.test_run = s.test_run
group by
r.test_run,r.statement_text,r.plan_name,r.fragment_type,r.mdam_disjunct;
create view compareview2 as
select m.st,m.avgtime as mdamavg, m.maxtime as mdammax, m.mintime as mdammin,
s.avgtime as simpleavg, s.maxtime as simplemax, s.mintime as simplemin
from resultsview2 m join resultsview2 s
on m.st = s.st
where m.plan_name = 'Optimizer choice' and s.plan_name = 'Simple Scan';
The query I used to get average execution times for the different queries and
plans is below:
select rv.st,
case when rv.plan_name = 'Optimizer choice'
then rv.plan_name || ' (' || oc.parallel || ' ' || oc.root_plan_name || ')'
when rv.plan_name = 'New cost choice'
then rv.plan_name || ' (' || nc.parallel || ' ' || nc.root_plan_name || ')'
else rv.plan_name end,
rv.avgtime
from resultsview2 rv left join optchoice oc on rv.st = oc.stmt
left join newchoice nc on rv.st = nc.stmt
order by rv.st,rv.avgtime;
was (Author: davebirdsall):
Here is how I did the recalibration.
# I created a table and, using gdb in IndexDesc :: pruneMdam determined the
row size, block size and rows per block computed for that table.
# I created a sqlci script that would populate a similar table with 500,000
rows, then used that table to populate other tables of varying sizes: 1 block
(=4096 rows in this case), 10 blocks and 100 blocks. The latter tables have a
two-column key. The UEC of the first column was controlled by using the MOD
function when populating; the second column had unique values.
# I then ran a python script that runs a query against a table, forcing simple
scan plans and various MDAM plans, and capturing the elapsed execution time.
The script runs each query three times, capturing execution times for each.
(The first time tends to be larger because the table gets read into HBase
cache. The second and third times tend to benefit from the table already being
in cache. So the read time is amortized across the three executions.)
# The tables I created were single-partition, so I looked only at serial
simple scan and serial mdam 2 deep plan times. I noted the average execution
times for each, and noted also what the Optimizer was picking (using the old
Mdam costing code, which is the default choice today).
The inequality we are trying to tune in IndexDesc :: pruneMdam is UEC <= Blocks
* MDAM_SELECTION_DEFAULT, that is, UEC / Blocks <= MDAM_SELECTION_DEFAULT. So,
we want this inequality to hold true when MDAM plans win (or even if MDAM plans
are close to winning), and allow this inequality to be false when MDAM plans
are bad.
I ran this test with UECs of 2, 20, 200, 50, 500, 80 and 800. The inequality
above worked best when MDAM_SELECTION_DEFAULT was given a value of 8.0.
The table shape I used is below:
For the following:
CREATE TABLE TRAFODION.SCH.T2
(
A INT NO DEFAULT NOT NULL NOT DROPPABLE NOT
SERIALIZED
, B INT NO DEFAULT NOT NULL NOT DROPPABLE NOT
SERIALIZED
, C INT DEFAULT NULL NOT SERIALIZED
, PRIMARY KEY (A ASC, B ASC)
)
ATTRIBUTES ALIGNED FORMAT
;
In IndexDesc :: pruneMdam, I saw the following for block size, row length and
rows per block:
(gdb) p blockSize
$8 = \{dpv_ = 65536, static ovflwCount_ = 0, static udflwCount_ = 0}
(gdb) p recordLength
$9 = \{dpv_ = 16, static ovflwCount_ = 0, static udflwCount_ = 0}
(gdb) p recordsPerBlock
$10 = \{dpv_ = 4096, static ovflwCount_ = 0, static udflwCount_ = 0}
(gdb)
Below is the sqlci script I used to populate the tables. (I manually edited the
script for different UECs.)
drop table if exists t2sourceuec200;
drop table if exists t2blk1uec200;
drop table if exists t2blk10uec200;
drop table if exists t2blk100uec200;
create table t2sourceuec200(a int not null, b int not null, c int, primary key
(b,a)); – key reversed for efficient upserts below
upsert using load into t2sourceuec200
select
mod(x1+10*x2+100*x3+1000*x4+10000*x5+100000*x6, 200),
x1+10*x2+100*x3+1000*x4+10000*x5+100000*x6,
x1+10*x2+100*x3+1000*x4+10000*x5+100000*x6
from (values(0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) T(x1)
transpose 0,1,2,3,4,5,6,7,8,9 as x2
transpose 0,1,2,3,4,5,6,7,8,9 as x3
transpose 0,1,2,3,4,5,6,7,8,9 as x4
transpose 0,1,2,3,4,5,6,7,8,9 as x5
transpose 0,1,2,3,4 as x6;
create table t2blk1uec200 (a int not null, b int not null, c int, primary key
(a,b));
create table t2blk10uec200 like t2blk1uec200;
create table t2blk100uec200 like t2blk1uec200;
upsert using load into t2blk1uec200
select a,b,c from t2sourceuec200 where b < 4096;
upsert using load into t2blk10uec200
select a,b,c from t2sourceuec200 where b < 40960;
upsert using load into t2blk100uec200
select a,b,c from t2sourceuec200 where b < 409600;
update statistics for table t2blk1uec200 on every column;
update statistics for table t2blk10uec200 on every column;
update statistics for table t2blk100uec200 on every column;
I am having trouble getting the Python script to format correctly here so I
will try to attach it separately.
The Python script generates INSERT statements with plan and performance data,
which can be uploaded into tables created with the following DDL:
-- Creates tables for data produced by MdamTests2.py script
?section tables
-- The Results table has one row for each statement processed
-- by MdamTests.py
drop table if exists results2 cascade;
create table results2
(
test_run numeric(17,6) not null,
plan_name varchar(28) not null,
statement_text varchar(4096) not null,
prepare_error varchar(256) not null,
execute_error varchar(256) not null,
fragment_type char(10) not null,
executor_preds varchar(1024) not null,
mdam_disjunct varchar(1024) not null,
begin_key varchar(1024) not null,
end_key varchar(1024) not null,
primary key (test_run)
)
salt using 4 partitions;
-- The Resultstats table has one row for each execution of a
-- statement processed by MdamTests2.py
-- Can join it to Results on test_run
drop table if exists resultstats2 cascade;
create table resultstats2
(
test_run numeric(17,6) not null,
execution_instance smallint not null,
start_time timestamp(6) not null,
end_time timestamp(6) not null,
records_accessed largeint not null,
records_used largeint not null,
hdfs_ios largeint not null,
hdfs_iobytes largeint not null,
hdfs_usecs largeint not null,
primary key (test_run, execution_instance)
)
salt using 4 partitions;
?section views
drop view resultsview2 cascade;
CREATE VIEW resultsview2 AS
SELECT r.test_run,
CAST(r.STATEMENT_TEXT AS CHAR(72)) AS ST,
case when r.plan_name = 'Simple scan' then
case when r.fragment_type = 'master' then 'Serial simple scan'
else 'Parallel simple scan' end
else r.plan_name end as plan_name,
r.mdam_disjunct,
min(s.end_time - s.start_time) as mintime,
max(s.end_time - s.start_time) as maxtime,
avg(s.end_time - s.start_time) as avgtime,
min(s.records_accessed) as minaccessed,
max(s.records_accessed) as maxaccessed,
avg(s.records_accessed) as avgaccessed,
min(s.records_used) as minused,
max(s.records_used) as maxused,
avg(s.records_used) as avgused,
min(s.hdfs_ios) as minios,
max(s.hdfs_ios) as maxios,
avg(s.hdfs_ios) as avgios,
min(s.hdfs_iobytes) as miniobytes,
max(s.hdfs_iobytes) as maxiobytes,
avg(s.hdfs_iobytes) as avgiobytes,
min(s.hdfs_usecs) as minusecs,
max(s.hdfs_usecs) as maxusecs,
avg(s.hdfs_usecs) as avgusecs
FROM RESULTS2 r JOIN resultstats2 s on r.test_run = s.test_run
group by
r.test_run,r.statement_text,r.plan_name,r.fragment_type,r.mdam_disjunct;
create view compareview2 as
select m.st,m.avgtime as mdamavg, m.maxtime as mdammax, m.mintime as mdammin,
s.avgtime as simpleavg, s.maxtime as simplemax, s.mintime as simplemin
from resultsview2 m join resultsview2 s
on m.st = s.st
where m.plan_name = 'Optimizer choice' and s.plan_name = 'Simple Scan';
The query I used to get average execution times for the different queries and
plans is below:
select rv.st,
case when rv.plan_name = 'Optimizer choice'
then rv.plan_name || ' (' || oc.parallel || ' ' || oc.root_plan_name || ')'
when rv.plan_name = 'New cost choice'
then rv.plan_name || ' (' || nc.parallel || ' ' || nc.root_plan_name || ')'
else rv.plan_name end,
rv.avgtime
from resultsview2 rv left join optchoice oc on rv.st = oc.stmt
left join newchoice nc on rv.st = nc.stmt
order by rv.st,rv.avgtime;
> Tweak some MDAM-related heuristics
> ----------------------------------
>
> Key: TRAFODION-2913
> URL: https://issues.apache.org/jira/browse/TRAFODION-2913
> Project: Apache Trafodion
> Issue Type: Bug
> Components: sql-cmp
> Affects Versions: 2.3
> Reporter: David Wayne Birdsall
> Assignee: David Wayne Birdsall
> Priority: Major
> Attachments: MdamTests2.py.txt
>
>
> While debugging a plan choice issue on a customer query, two issues were
> noted with MDAM heuristics.
> # When CQD FSO_TO_USE is set to '0', FileScanOptimizer::optimize attempts to
> perform logic similar to that in ScanOptimizer::getMdamStatus, checking the
> mdamFlag that is stored in the index descriptor. But the logic is not the
> same (the inevitable result of having two copies of something!); in the
> latter case the mdamFlag is ignored if CQD RANGESPEC_TRANSFORMATION is 'ON'
> while in the FileScanOptimizer::optimize logic no such additional check is
> made. Now, 'ON' is presently the default for RANGESPACE_TRANSFORMATION. So,
> we have the anomaly that using CQD FSO_TO_USE '0' to force consideration of
> MDAM might still not get MDAM because of a flag that we would ignore
> otherwise.
> # The mdamFlag in the IndexDesc object is set by IndexDesc :: pruneMdam
> (optimizer/IndexDesc.cpp). There is heuristic logic there to guess whether
> MDAM will be useful for a given access path. The major purpose of this logic
> is index elimination: if we have several indexes, and some look like good
> choices for MDAM and others not, we eliminate the ones that are not. Only
> secondarily is this mdam flag later looked at by the scan optimizer, as
> described above in 1. The major purpose of this logic still seems reasonable,
> though the computation logic itself can be criticized for not considering the
> possibility of a parallel predicate on a leading "_SALT_" column, for
> example. But the computation involves a CQD, MDAM_SELECTION_DEFAULT, which is
> set to a low value by default. The customer query involved showed that the
> value used is too low; this flag ended up eliminating a favorable MDAM plan.
> The default was likely last determined in the predecessor product; given that
> the HBase engine has different execution dynamics this value needs to be
> recalibrated.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)