Re: [Firebird-devel] Cursor stability in READ COMMITTED transactions

Nikolay Samofatov Thu, 26 Jun 2014 11:54:06 -0700

Vlad, All,

On 06/25/2014 05:58 PM, Vlad Khorsun wrote:
>> AFAIK, concurrent GC in pre-committed transactions without transaction lock 
>> was unsafe, and created
>> permanently broken databases.
>> I believe corruption has become especially apparent after commit intended to 
>> fix CORE-3515. I don't
>> know who fixed it, but Red Soft builds include a patch for this problem for 
>> about a year. I was
>> surprised to learn last week than vanilla Firebird still has the problem.
>      Do you speak about CORE-3921 ? If yes, it was fixed more than half a 
> year ago.
>
> ...


Yes, you are correct. I didn't know about existence of CORE-3921. By "vanilla" 
I meant lastest 
released version Firebird 2.5.2.
I compared patch with 2.5.2 source code, and that was probably my mistake.

>>>> We tried to be extra careful not to hurt performance while establishing 
>>>> these snapshots.
>>>    > We do it much more efficiently than done for isc_tpb_concurrency.
>>>
>>>      AFAIU, you maintain list of active transactions at top-level request. 
>>> You create this list
>>> using new Lock Manager call which iterates thru LCK_tra locks and calls 
>>> callback which fills
>>> list (actually BePlusTree) of active transactions.
>>>      Later reader consult this list (i.e. search BePlusTree) to check 
>>> transaction state.
>> Yes, the data structure used is a bit more involved, but in general you are 
>> correct.
>>
>> BePlusTree based SparseBitmap implementation is quite quick
>      Let me disagree here ;)

Please, do not accuse me of bad implementation without backing it up with some 
sort of numbers.
BePlusTree and SparseBitmap both spent quite some time under profiler, and are 
extremely fast and 
scalable data structures.
Usually much faster than ad-hoc data structures used elsewhere.

>
>> and is currently used in the most performance-sensitive parts of the code 
>> (data index retrieval by
>> index, for example).
>      Sorry, but i don't understand what is "data index retrieval by index"...

You know that it is used to store intermediate RecordBitmap's during index 
retrieval operations.
This is very performance sensitive code, and this implementation outperformed 
and used less memory 
than older SBM-based implementation.

>>>      Is it really faster than copy part of (already cached) TIP contents 
>>> into transaction and then
>>> test a bit in a plain vector ? Also, it allows to reuse existing code.
>> No, this actually would have horrible performance characteristics. Take 
>> BroadView, for example.
>> They have many millions transactions processed daily, and TIP size of 
>> 1500-4000 pages is typical.
>> Imagine how costly is to copy such a thing?
>      Do you speak about active part of the TIP (between OIT and Next) or 
> about whole TIP ? I can easily
> imagine TIP of 4000 pages. It is not a problem as we have no need to copy all 
> that pages, only starting
> from OIT value. But if BroadView permanently have stuck OIT with difference 
> of few millions... it is not
> good for them. Hope it is not so.

Well, currently entire TIP is read/scanned/copied in quite a few places. AFAIK, 
there is no 
optimization to analyze "interesting" portion only.

Even so, your assumptions that applications can generate only small number of 
transactions are not 
correct.

Both Red Soft and BroadView applications process an order of 10 millons 
transactions daily on each 
database. Sometimes more.
There could also be some long-running operations (e.g. exports or reports). So 
OIT gap of 10^6 
represents 1-2 hours of work and is relatively healthy.
For example, now I can see OIT gap of 500k on a perfectly healthy moderately 
loaded database 
instance (client "V" of BroadView).

On some instances, daily sweep is impossible due to performance reasons, and it 
is performed weekly.
So you can get a gap 40-60 million transactions between oldest transaction and 
next.

And yes, database has to be taken offline for backup-restore once in a while, 
or you hit 2 billion 
transaction mark.

>
> I guess that memove of 4-8-16KB is much faster than 100-1000 calls of 
> callback with search in
> BePlusTree... Also, note, you pay additional cost of the search in BePlusTree 
> every time you test
> transaction state, i.e. for every record version accessed by RC transaction.

I think your assumptions about comparative performance overheads of different 
operations are incorrect.

To use TPC for setting up snapshot one has to ensure it is current and 
consistent while you set up 
the snapshot.
This is very expensive proposition, when compared to a few simple function 
calls.

In my code I went to great lengths to avoid even a single page read when 
setting up the snapshot.
Applications tend to submit very large number of queries (many thousands per 
second) and slow 
snapshots would kill them.

BTW, AST overhead on obtaining STATEMENT_ID is the major bottleneck in this 
situation, until you 
apply the appropriate patch.
I can only assume this change didn't make it to FB3 because costs of 2 context 
switches and multiple 
syscalls per ID cache fetch are assumed to be zero by the project. I don't 
know. Oh, yes, and to add 
insult to an injury, despite costly synchronization this counter wraps around 
32-bit integer often 
enough to be non-unique anyway. :-)

> To make a brief resume - i can't point to an error in your logic (it doesn't 
> mean it have no error),

I share your paranoia and reviewed the code once again. There was a small 
inconsistency in handling 
long-running RO/RC transactions.
On one side, I was a bit too pessimistic about them and let them hold GC for 
too long, and on the 
other side,
there was one extremely unlikely case when it was still possible for RC/RO 
transaction to observe 
inconsistent data.
Both problems are fixed, and there are no additional performance implications.

So there is now a clean implementation of statement-level read consistency, and 
it is in Red Soft 
stable tree, used by default. It will be tested in the field, and we will see 
how it performs in 
practice.

But it still inhibits GC for RO/RC transactions in some cases, which AFAIU is 
considered 
unacceptable by the Firebird project.

I currently study the problem of GC of intermediate versions. It seems doable, 
with the logic along 
the following lines:

1. To protect snapshots from GC one needs to acquire 2 kinds of locks:
- transaction interest lock (one for each transaction active at snapshot start, 
lck_key = ID of 
active transaction)
- snapshot lock (one for each snapshot, lck_key = ID of top transaction at 
snapshot start)

Idea is that:
- you cannot GC versions for which there is transaction interest lock
- at most one record version is needed for each snapshot

2. If we found a version (in VIO_chase_record_version and VIO_garbage_collect), 
we may use the 
following logic to GC it:
- if version satisfies conditions for purge then use that algorithm for GC
- version is candidate for "intermediate GC" if it is not current, there is no 
interest lock on its 
transaction, its forward version corresponds to the same snapshot as this 
version and is committed

3. "Intermediate GC" algorithm needs to study the entire version chain, and 
determine which versions 
may go and which need to stay.
Actual GC may use the same idea that is used in update_in_place:

Imagine you have versions "A", "I", "B".  Intermediate version "I" needs to go, 
and "A" needs to stay.
First, convert delta version "A" to full version.
Second, update back-pointers to it in "B".
Third, delete version "I".
Forth, GC blobs and indices as appropriate.

This is the general idea, there are many corner cases which need to be handled 
as well.

Difficult part is how to implement this logic so it does not hurt performance 
for regular operations.
This seems possible, for as long we can extend lock manager interface a little 
(LCK_lock_multiple_nowait, LCK_release_multiple, etc).

This direction will take a bit of time to study, implement, test and profile.

Thanks,
Nikolay


------------------------------------------------------------------------------
Open source business process management suite built on Java and Eclipse
Turn processes into business applications with Bonita BPM Community Edition
Quickly connect people, data, and systems into organized workflows
Winner of BOSSIE, CODIE, OW2 and Gartner awards
http://p.sf.net/sfu/Bonitasoft
Firebird-Devel mailing list, web interface at 
https://lists.sourceforge.net/lists/listinfo/firebird-devel

Re: [Firebird-devel] Cursor stability in READ COMMITTED transactions

Reply via email to