Online validation can lead to data inconsistency during intensive DML ---------------------------------------------------------------------
Key: CORE-4973 URL: http://tracker.firebirdsql.org/browse/CORE-4973 Project: Firebird Core Issue Type: Bug Reporter: Pavel Zotov Attachments: bulk-records-exchange-with-online-validation-data-consistency-test.zip Consider following scenario (something similar to bank transactions when bulk of funds are moving from sources to targets and vice versa): 1. Let we have sequence (name = 'g') and two tables, QA & QB which both can act like source and target for "moving" their rows. 2. DDL of these tables is the same: (x int, y int, w int), and also they have compound indices on key (w, x, y). 3. Tables QA & QB are filled with 50'000 records, but value of field 'w' changes in limit [0, 399]. 4. Value for writing in the field 'w' is calculated during initial data filling process by formula: mod( gen_id(g,1), 400). After 50'000 rows will be inserted into each of tables QA|QB, following queries: SELECT W, COUNT(*) as CNT FROM QA UNION ALL SELECT W, COUNT(*) as CNT FROM QB -- will return two rows with the same value of CNT = 125. 5. When some ISQL will start job by moving dozen of records from QA to QB (or vice versa), total number of rows with the same value of 'W' in tables QA & QB should ALWAYS be equal 125 + 125 = 250. This value (250) should remain constant regardless of number ISQL sessions and their concurrent conflicts. And of course, this value should not be changed by any actions that can occur against database - e.g. online validation. Some time ago I've found strange effect: when online validation is launched during several (even single!) ISQL doing work on 'moving' records data in the tables QA and QB can become inconsistent. The rule about total sum of counts = 125 + 125 = 250 will be broken. Following is about test implementation. 1. DDL-script for creating test database and initial data filling (see also in attached .zip, file "exch-ddl.sql"): === create or alter view v_check_for_mismatch as select 1 id from rdb$database; recreate exception exc_strange_gds 'Got strange gdscode = @1.'; recreate exception exc_mism_detected 'Mismatch detected at least for id=@1.'; set term ^; execute block as begin execute statement 'drop sequence g'; when any do begin end end ^ set term ;^ commit; create sequence g; recreate table stoptest(id int, qa_cnt int, qb_cnt int); recreate table qa(x int, y int, w int); recreate table qb(x int, y int, w int); recreate table exc_log(fbgds int); insert into qa(x, y, w) select rand()*5, rand()*5, mod( gen_id(g,1), 400) from rdb$types,rdb$types rows 50000; insert into qb(x, y, w) select rand()*5, rand()*5, mod( gen_id(g,1), 400) from rdb$types,rdb$types rows 50000; commit; create index qa_xy on qa(w, x, y); create index qb_xy on qb(w, x, y); commit; recreate global temporary table gtt_job(w int primary key using index gtt_job_pk, c int); commit; create or alter view v_check_for_mismatch as select w, max(qa_cnt) qa_cnt, max(qb_cnt) qb_cnt from ( select 'qa' src, w, count(*) qa_cnt, 0 as qb_cnt from qa group by w union all select 'qb' src, w, 0, count(*) qb_cnt from qb group by w ) group by w having max(qa_cnt) + max(qb_cnt) is distinct from 250; === 2. This script is what each ISQL should perform (see also file "exch-dml.sql" in attach): === set list on; set bail on; commit; set transaction snapshot no wait; select current_connection ,current_transaction ,'Start job at ' || current_timestamp as msg from rdb$database; set term ^; execute block as declare m int = 50; declare i int; declare a_w int; declare v_c int; declare v_w int; declare v_x int; declare v_y int; declare c_qa cursor for (select w, x, y from qa where w = :a_w); declare c_qb cursor for (select w, x, y from qb where w = :a_w); declare v_qa_inserted int = 0; declare v_qb_inserted int = 0; declare v_qa_deleted int = 0; declare v_qb_deleted int = 0; begin rdb$set_context('USER_SESSION', 'QA_INSERTED', null); rdb$set_context('USER_SESSION', 'QB_INSERTED', null); rdb$set_context('USER_SESSION', 'QA_DELETED', null); rdb$set_context('USER_SESSION', 'QB_DELETED' , null); if ( exists( select * from stoptest ) ) then exception exc_mism_detected using ( (select id from stoptest rows 1) ); delete from gtt_job; i = m; while (i > 0) do begin merge into gtt_job t using (select cast(rand()*400 as int) w, cast(rand()*20 as int) c from rdb$database union all select 1,1 from rdb$database where 1=0) s on s.w = t.w when not matched then insert values(s.w, s.c) when matched then update set t.c = t.c + s.c ; i = i -1; end for select w, c from gtt_job into a_w, v_c do begin open c_qa; while (v_c > 0) do begin begin fetch c_qa into v_w, v_x, v_y; if (row_count = 0) then leave; delete from qa where current of c_qa; insert into qb(w, x, y) values(:v_w, :v_x, :v_y); v_qa_deleted = v_qa_deleted + 1; v_qb_inserted = v_qb_inserted + 1; when any do begin -- lock_conflict, concurrent_transaction, deadlock, update_conflict if ( gdscode NOT in (335544345, 335544878, 335544336,335544451 ) ) then begin in autonomous transaction do insert into exc_log(fbgds) values(gdscode); exception exc_strange_gds using (gdscode); end end end v_c = v_c - 1; end close c_qa; end delete from gtt_job; i = m; while (i > 0) do begin merge into gtt_job t using (select cast(rand()*400 as int) w, cast(rand()*20 as int) c from rdb$database union all select 1,1 from rdb$database where 1=0) s on s.w = t.w when not matched then insert values(s.w, s.c) when matched then update set t.c = t.c + s.c ; i = i -1; end for select w, c from gtt_job into a_w, v_c do begin open c_qb; while (v_c > 0) do begin begin fetch c_qb into v_w, v_x, v_y; if (row_count = 0) then leave; delete from qb where current of c_qb; insert into qa(w, x, y) values(:v_w, :v_x, :v_y); v_qb_deleted = v_qb_deleted + 1; v_qa_inserted = v_qa_inserted + 1; when any do begin -- lock_conflict, concurrent_transaction, deadlock, update_conflict if ( gdscode NOT in (335544345, 335544878, 335544336,335544451 ) ) then begin in autonomous transaction do insert into exc_log(fbgds) values(gdscode); exception exc_strange_gds using (gdscode); end end end v_c = v_c -1; end close c_qb; end rdb$set_context('USER_SESSION', 'QA_INSERTED', v_qa_inserted); rdb$set_context('USER_SESSION', 'QB_INSERTED', v_qb_inserted); rdb$set_context('USER_SESSION', 'QA_DELETED', v_qa_deleted); rdb$set_context('USER_SESSION', 'QB_DELETED' , v_qb_deleted); end ^ set term ;^ select current_connection ,current_transaction ,'Finish job at ' || current_timestamp as msg ,rdb$get_context('USER_SESSION', 'QA_INSERTED') as qa_inserted ,rdb$get_context('USER_SESSION', 'QB_INSERTED') as qb_inserted ,rdb$get_context('USER_SESSION', 'QA_DELETED') as qa_deleted ,rdb$get_context('USER_SESSION', 'QB_DELETED') as qb_deleted ,iif( rdb$get_context('USER_SESSION', 'QA_INSERTED') is distinct from rdb$get_context('USER_SESSION', 'QB_DELETED') or rdb$get_context('USER_SESSION', 'QB_INSERTED') is distinct from rdb$get_context('USER_SESSION', 'QA_DELETED') ,'### ACHTUNG ###' ,'Ok.' ) as check_result from rdb$database; commit; set bail off; === 3. This is main batch file ("exch-run.bat" in attached .zip): === @echo off setlocal enabledelayedexpansion enableextensions set winq=%1 if .%winq%.==.. set winq=30 md logs 2>nul set host=localhost set port=3333 set dbnm=%~dp0tmp.fdb set usr=SYSDBA set pwd=masterkey set fbc=C:\MIX\firebird\fb30 set stopfile=logs\1mismatches.txt set delay=5 del %dbnm% 2>nul del %stopfile% 2>nul del logs\*.log 2>nul del logs\*.err 2>nul echo create database '%host%/%port%:%dbnm%' user '%usr%' password '%pwd%'; | %fbc%\isql -q %fbc%\gfix -w async %host%/%port%:%dbnm% echo on echo in %~dp0exch-ddl.sql; | %fbc%\isql %host%/%port%:%dbnm% -q -user %usr% -password %pwd% @echo off @rem ############### set run_online_val=1 @rem ############### echo. echo Test database is ready. Validation setting: %run_online_val% echo. echo Press any key to launch %winq% ISQL sessions. . . echo. pause > nul for /l %%i in (0, 1, !winq!) do ( if .%%i.==.0. ( if .%run_online_val%.==.1. ( start /min cmd /c exch-val.bat %host% %port% %dbnm% %usr% %pwd% %fbc% %stopfile% %delay% ) ) else ( set /a k=1000+%%i set k=!k:~1,3! set log=logs\%~n0-!k!.log set err=logs\%~n0-!k!.err del !log! 2>nul del !err! 2>nul set cmd_sql=exch-dml.bat %host% %port% %dbnm% %usr% %pwd% %stopfile% ^| %fbc%\isql -q 1^>!log! 2^>!err! start /min cmd /c !cmd_sql! ) ) for /l %%i in (1,1,1) do ( set log=logs\%~n0-chk.log set err=logs\%~n0-chk.err del !log! 2>nul del !err! 2>nul set cmd_sql=exch-chk.bat %host% %port% %dbnm% %usr% %pwd% %stopfile% %delay% !log! ^| isql -q -pag 9999 1^>!log! 2^>!err! echo !cmd_sql! start /min cmd /c !cmd_sql! ) echo Main batch has done its work, now wait for mismatches. Bye-bye from %~f0. exit === Change setting in this batch to yours, namely: host, port, dbnm, usr, pwd and fbc (path to FB binaries on client machine). Starting part of this batch is: ... for /l %%i in (0, 1, !winq!) do ( ... ) ... This part will launch in separate windows two auxiliary batches (they also are in attached .zip): a) batch for online validation will be run ('exch-val.bat') - but only if setting %run_online_val% = 1, see below about it. b) batch for each ISQL session will be launched in order to make bulk records movements ('exch-dml.bat') Final part of main batch ("exch-run.bat") is: ... set cmd_sql=exch-chk.bat %host% %port% %dbnm% %usr% %pwd% %stopfile% %delay% !log! ^| isql -q -pag 9999 1^>!log! 2^>!err! echo !cmd_sql! start /min cmd /c !cmd_sql! ... This part will launch batch which make check of data consistency in the both tables (QA & QB). It does such check bu querying special VIEW ("v_check_for_mismatch") and inserting its result into 'stop'-table: ... out %stopfile%; insert into stoptest(id, qa_cnt, qb_cnt) select w, qa_cnt, qb_cnt from v_check_for_mismatch; set list off; select x.* from ( select s.*, 250 as expected_sum, s.qa_cnt + s.qb_cnt actual_sum from stoptest s ^) x order by abs( x.expected_sum - x.actual_sum ^) desc, id; set list on; commit; out; ... When at least one mismatch will be found, table 'stoptest' will become non-empty. Every ISQL session checks this table on every iteration. When at least one record appear in 'stoptest', ISQL session will rase exception (with name "exc_mism_detected") and will terminate itself. Now pay attention on this command: @rem ############### set run_online_val=0 @rem ############### When this variable is set to 0, you can launch as much ISQL sessions as you want, and all of them will work infinite time. And when this variable is changed to 1 (i.e. ONLINE VALIDATION will be launched in loop with some delay), even single ISQL session will work no more than several minutes. Test will finish with exception that is raised because mismatches will be found by "exch-chk.bat" and its script. So, online validation AFFECTS in some mysterious way on data consistency. PS. This strange affect can also be observed in the trace log. For SUCCESSFUL execution of each DML iteration trace will have lines like these: === 0 records fetched 228 ms, 1 write(s), 15429 fetch(es), 3797 mark(s) Table Natural Index Update Insert Delete ********************************************************************************* RDB$DATABASE 200 RDB$INDICES 1 QA 541 599 541 QB 599 541 599 GTT_JOB 146 10 3 97 49 === (i.e. QA_inserts = QB_deletes = 599; QA_deletes = QB_inserts = 541 - see above). But when check for data consistency FAILS, one may found in the trace following: === Table Natural Index Update Insert Delete Backout Purge Expunge *************************************************************************************************************** RDB$DATABASE 200 RDB$INDICES 1 QA 3145 186 1 QB 964 GTT_JOB 317 13 9 91 44 === or: === Table Natural Index Update Insert Delete Backout Purge Expunge *************************************************************************************************************** RDB$DATABASE 200 RDB$INDICES 1 QA 541 599 457 101 QB 599 316 599 GTT_JOB 146 10 3 97 49 === or: === Table Natural Index Update Insert Delete ********************************************************************************* RDB$DATABASE 200 RDB$INDICES 3 QA 534 463 337 QB 463 74 463 GTT_JOB 146 9 3 97 49 === (in all these samples QA_inserts <> QB_deletes or QA_deletes <> QB_inserts). PPS. Done at WI-V3.0.0.32084, but I've seen this effect yet in july/august-2015 during launch online-validation against OLTP-EMUL test was in work. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://tracker.firebirdsql.org/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira ------------------------------------------------------------------------------ Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel