Online validation can lead to data inconsistency during intensive DML
---------------------------------------------------------------------
Key: CORE-4973
URL: http://tracker.firebirdsql.org/browse/CORE-4973
Project: Firebird Core
Issue Type: Bug
Reporter: Pavel Zotov
Attachments:
bulk-records-exchange-with-online-validation-data-consistency-test.zip
Consider following scenario (something similar to bank transactions when bulk
of funds are moving from sources to targets and vice versa):
1. Let we have sequence (name = 'g') and two tables, QA & QB which both can act
like source and target for "moving" their rows.
2. DDL of these tables is the same: (x int, y int, w int), and also they have
compound indices on key (w, x, y).
3. Tables QA & QB are filled with 50'000 records, but value of field 'w'
changes in limit [0, 399].
4. Value for writing in the field 'w' is calculated during initial data filling
process by formula: mod( gen_id(g,1), 400).
After 50'000 rows will be inserted into each of tables QA|QB, following
queries:
SELECT W, COUNT(*) as CNT FROM QA
UNION ALL
SELECT W, COUNT(*) as CNT FROM QB
-- will return two rows with the same value of CNT = 125.
5. When some ISQL will start job by moving dozen of records from QA to QB (or
vice versa), total number of rows with the same
value of 'W' in tables QA & QB should ALWAYS be equal 125 + 125 = 250. This
value (250) should remain constant regardless
of number ISQL sessions and their concurrent conflicts.
And of course, this value should not be changed by any actions that can
occur against database - e.g. online validation.
Some time ago I've found strange effect: when online validation is launched
during several (even single!) ISQL doing work on 'moving' records
data in the tables QA and QB can become inconsistent. The rule about total sum
of counts = 125 + 125 = 250 will be broken.
Following is about test implementation.
1. DDL-script for creating test database and initial data filling (see also in
attached .zip, file "exch-ddl.sql"):
===
create or alter view v_check_for_mismatch as select 1 id from rdb$database;
recreate exception exc_strange_gds 'Got strange gdscode = @1.';
recreate exception exc_mism_detected 'Mismatch detected at least for
id=@1.';
set term ^;
execute block as
begin
execute statement 'drop sequence g';
when any do begin end
end
^ set term ;^
commit;
create sequence g;
recreate table stoptest(id int, qa_cnt int, qb_cnt int);
recreate table qa(x int, y int, w int);
recreate table qb(x int, y int, w int);
recreate table exc_log(fbgds int);
insert into qa(x, y, w)
select rand()*5, rand()*5, mod( gen_id(g,1), 400)
from rdb$types,rdb$types
rows 50000;
insert into qb(x, y, w)
select rand()*5, rand()*5, mod( gen_id(g,1), 400)
from rdb$types,rdb$types
rows 50000;
commit;
create index qa_xy on qa(w, x, y);
create index qb_xy on qb(w, x, y);
commit;
recreate global temporary table gtt_job(w int primary key using index
gtt_job_pk, c int);
commit;
create or alter view v_check_for_mismatch as
select w, max(qa_cnt) qa_cnt, max(qb_cnt) qb_cnt
from (
select 'qa' src, w, count(*) qa_cnt, 0 as qb_cnt
from qa
group by w
union all
select 'qb' src, w, 0, count(*) qb_cnt
from qb
group by w
)
group by w
having max(qa_cnt) + max(qb_cnt) is distinct from 250;
===
2. This script is what each ISQL should perform (see also file "exch-dml.sql"
in attach):
===
set list on;
set bail on;
commit;
set transaction snapshot no wait;
select
current_connection
,current_transaction
,'Start job at ' || current_timestamp as msg
from rdb$database;
set term ^;
execute block
as
declare m int = 50;
declare i int;
declare a_w int;
declare v_c int;
declare v_w int;
declare v_x int;
declare v_y int;
declare c_qa cursor for (select w, x, y from qa where w = :a_w);
declare c_qb cursor for (select w, x, y from qb where w = :a_w);
declare v_qa_inserted int = 0;
declare v_qb_inserted int = 0;
declare v_qa_deleted int = 0;
declare v_qb_deleted int = 0;
begin
rdb$set_context('USER_SESSION', 'QA_INSERTED', null);
rdb$set_context('USER_SESSION', 'QB_INSERTED', null);
rdb$set_context('USER_SESSION', 'QA_DELETED', null);
rdb$set_context('USER_SESSION', 'QB_DELETED' , null);
if ( exists( select * from stoptest ) )
then
exception exc_mism_detected using ( (select id from stoptest rows
1) );
delete from gtt_job;
i = m;
while (i > 0) do
begin
merge into gtt_job t
using (select cast(rand()*400 as int) w, cast(rand()*20 as int) c
from rdb$database union all select 1,1 from rdb$database where 1=0) s
on s.w = t.w
when not matched then insert values(s.w, s.c)
when matched then update set t.c = t.c + s.c
;
i = i -1;
end
for select w, c from gtt_job into a_w, v_c
do begin
open c_qa;
while (v_c > 0) do
begin
begin
fetch c_qa into v_w, v_x, v_y;
if (row_count = 0) then leave;
delete from qa where current of c_qa;
insert into qb(w, x, y) values(:v_w, :v_x, :v_y);
v_qa_deleted = v_qa_deleted + 1;
v_qb_inserted = v_qb_inserted + 1;
when any do
begin
-- lock_conflict, concurrent_transaction, deadlock,
update_conflict
if ( gdscode NOT in (335544345, 335544878, 335544336,335544451
) )
then
begin
in autonomous transaction do insert into exc_log(fbgds)
values(gdscode);
exception exc_strange_gds using (gdscode);
end
end
end
v_c = v_c - 1;
end
close c_qa;
end
delete from gtt_job;
i = m;
while (i > 0) do
begin
merge into gtt_job t
using (select cast(rand()*400 as int) w, cast(rand()*20 as int) c
from rdb$database union all select 1,1 from rdb$database where 1=0) s
on s.w = t.w
when not matched then insert values(s.w, s.c)
when matched then update set t.c = t.c + s.c
;
i = i -1;
end
for select w, c from gtt_job into a_w, v_c
do begin
open c_qb;
while (v_c > 0) do
begin
begin
fetch c_qb into v_w, v_x, v_y;
if (row_count = 0) then leave;
delete from qb where current of c_qb;
insert into qa(w, x, y) values(:v_w, :v_x, :v_y);
v_qb_deleted = v_qb_deleted + 1;
v_qa_inserted = v_qa_inserted + 1;
when any do
begin
-- lock_conflict, concurrent_transaction, deadlock,
update_conflict
if ( gdscode NOT in (335544345, 335544878, 335544336,335544451
) )
then
begin
in autonomous transaction do insert into exc_log(fbgds)
values(gdscode);
exception exc_strange_gds using (gdscode);
end
end
end
v_c = v_c -1;
end
close c_qb;
end
rdb$set_context('USER_SESSION', 'QA_INSERTED', v_qa_inserted);
rdb$set_context('USER_SESSION', 'QB_INSERTED', v_qb_inserted);
rdb$set_context('USER_SESSION', 'QA_DELETED', v_qa_deleted);
rdb$set_context('USER_SESSION', 'QB_DELETED' , v_qb_deleted);
end
^
set term ;^
select
current_connection
,current_transaction
,'Finish job at ' || current_timestamp as msg
,rdb$get_context('USER_SESSION', 'QA_INSERTED') as qa_inserted
,rdb$get_context('USER_SESSION', 'QB_INSERTED') as qb_inserted
,rdb$get_context('USER_SESSION', 'QA_DELETED') as qa_deleted
,rdb$get_context('USER_SESSION', 'QB_DELETED') as qb_deleted
,iif( rdb$get_context('USER_SESSION', 'QA_INSERTED') is distinct from
rdb$get_context('USER_SESSION', 'QB_DELETED')
or
rdb$get_context('USER_SESSION', 'QB_INSERTED') is distinct from
rdb$get_context('USER_SESSION', 'QA_DELETED')
,'### ACHTUNG ###'
,'Ok.'
) as check_result
from rdb$database;
commit;
set bail off;
===
3. This is main batch file ("exch-run.bat" in attached .zip):
===
@echo off
setlocal enabledelayedexpansion enableextensions
set winq=%1
if .%winq%.==.. set winq=30
md logs 2>nul
set host=localhost
set port=3333
set dbnm=%~dp0tmp.fdb
set usr=SYSDBA
set pwd=masterkey
set fbc=C:\MIX\firebird\fb30
set stopfile=logs\1mismatches.txt
set delay=5
del %dbnm% 2>nul
del %stopfile% 2>nul
del logs\*.log 2>nul
del logs\*.err 2>nul
echo create database '%host%/%port%:%dbnm%' user '%usr%' password '%pwd%';
| %fbc%\isql -q
%fbc%\gfix -w async %host%/%port%:%dbnm%
echo on
echo in %~dp0exch-ddl.sql; | %fbc%\isql %host%/%port%:%dbnm% -q -user %usr%
-password %pwd%
@echo off
@rem ###############
set run_online_val=1
@rem ###############
echo.
echo Test database is ready. Validation setting: %run_online_val%
echo.
echo Press any key to launch %winq% ISQL sessions. . .
echo.
pause > nul
for /l %%i in (0, 1, !winq!) do (
if .%%i.==.0. (
if .%run_online_val%.==.1. (
start /min cmd /c exch-val.bat %host% %port% %dbnm% %usr% %pwd% %fbc%
%stopfile% %delay%
)
) else (
set /a k=1000+%%i
set k=!k:~1,3!
set log=logs\%~n0-!k!.log
set err=logs\%~n0-!k!.err
del !log! 2>nul
del !err! 2>nul
set cmd_sql=exch-dml.bat %host% %port% %dbnm% %usr% %pwd% %stopfile% ^|
%fbc%\isql -q 1^>!log! 2^>!err!
start /min cmd /c !cmd_sql!
)
)
for /l %%i in (1,1,1) do (
set log=logs\%~n0-chk.log
set err=logs\%~n0-chk.err
del !log! 2>nul
del !err! 2>nul
set cmd_sql=exch-chk.bat %host% %port% %dbnm% %usr% %pwd% %stopfile%
%delay% !log! ^| isql -q -pag 9999 1^>!log! 2^>!err!
echo !cmd_sql!
start /min cmd /c !cmd_sql!
)
echo Main batch has done its work, now wait for mismatches. Bye-bye from
%~f0.
exit
===
Change setting in this batch to yours, namely: host, port, dbnm, usr, pwd and
fbc (path to FB binaries on client machine).
Starting part of this batch is:
...
for /l %%i in (0, 1, !winq!) do (
...
)
...
This part will launch in separate windows two auxiliary batches (they also are
in attached .zip):
a) batch for online validation will be run ('exch-val.bat') - but only if
setting %run_online_val% = 1, see below about it.
b) batch for each ISQL session will be launched in order to make bulk records
movements ('exch-dml.bat')
Final part of main batch ("exch-run.bat") is:
...
set cmd_sql=exch-chk.bat %host% %port% %dbnm% %usr% %pwd% %stopfile%
%delay% !log! ^| isql -q -pag 9999 1^>!log! 2^>!err!
echo !cmd_sql!
start /min cmd /c !cmd_sql!
...
This part will launch batch which make check of data consistency in the both
tables (QA & QB).
It does such check bu querying special VIEW ("v_check_for_mismatch") and
inserting its result into 'stop'-table:
...
out %stopfile%;
insert into stoptest(id, qa_cnt, qb_cnt)
select w, qa_cnt, qb_cnt
from v_check_for_mismatch;
set list off;
select x.*
from (
select s.*, 250 as expected_sum, s.qa_cnt + s.qb_cnt actual_sum
from stoptest s
^) x
order by abs( x.expected_sum - x.actual_sum ^) desc, id;
set list on;
commit;
out;
...
When at least one mismatch will be found, table 'stoptest' will become
non-empty.
Every ISQL session checks this table on every iteration. When at least one
record appear in 'stoptest', ISQL session will rase exception (with name
"exc_mism_detected") and will terminate itself.
Now pay attention on this command:
@rem ###############
set run_online_val=0
@rem ###############
When this variable is set to 0, you can launch as much ISQL sessions as you
want, and all of them will work infinite time.
And when this variable is changed to 1 (i.e. ONLINE VALIDATION will be launched
in loop with some delay), even single ISQL
session will work no more than several minutes.
Test will finish with exception that is raised because mismatches will be found
by "exch-chk.bat" and its script.
So, online validation AFFECTS in some mysterious way on data consistency.
PS. This strange affect can also be observed in the trace log.
For SUCCESSFUL execution of each DML iteration trace will have lines like these:
===
0 records fetched
228 ms, 1 write(s), 15429 fetch(es), 3797 mark(s)
Table Natural Index Update Insert
Delete
*********************************************************************************
RDB$DATABASE 200
RDB$INDICES 1
QA 541 599
541
QB 599 541
599
GTT_JOB 146 10 3 97
49
===
(i.e. QA_inserts = QB_deletes = 599; QA_deletes = QB_inserts = 541 - see above).
But when check for data consistency FAILS, one may found in the trace following:
===
Table Natural Index Update Insert
Delete Backout Purge Expunge
***************************************************************************************************************
RDB$DATABASE 200
RDB$INDICES 1
QA 3145
186 1
QB 964
GTT_JOB 317 13 9 91
44
===
or:
===
Table Natural Index Update Insert
Delete Backout Purge Expunge
***************************************************************************************************************
RDB$DATABASE 200
RDB$INDICES 1
QA 541 599
457 101
QB 599 316
599
GTT_JOB 146 10 3 97
49
===
or:
===
Table Natural Index Update Insert
Delete
*********************************************************************************
RDB$DATABASE 200
RDB$INDICES 3
QA 534 463
337
QB 463 74
463
GTT_JOB 146 9 3 97
49
===
(in all these samples QA_inserts <> QB_deletes or QA_deletes <> QB_inserts).
PPS.
Done at WI-V3.0.0.32084, but I've seen this effect yet in july/august-2015
during launch online-validation against OLTP-EMUL test was in work.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://tracker.firebirdsql.org/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
------------------------------------------------------------------------------
Firebird-Devel mailing list, web interface at
https://lists.sourceforge.net/lists/listinfo/firebird-devel