Hi All,
I have encountered something that I would describe as weird behaviour, although
I must admit that I'm not an expert on Firebird and cannot say whether this
should be expected or not. So... first some basic info:
We have a system which relies on Firebird for persistent storage. The system is
fairly mature however new functions are added as needed.
The system uses firebird 2.5.4 superserver, ibproviderv2 and MDAC 2.7, however
when researching the issue I have updated our reference system to firebird
2.5.6 since source-code and pdb's were available for this version. The issue
still occurs using 2.5.6. We are using WAIT transactions (e.g. the default
transaction mode).
A thread is trying to perform a SQL select, a client wants to display detailed
information for a certain row.
Basicly Select * from TABLE where PKEY = PKEY
The thread acquires lock in our server, then works it's way through GDS32.dll
(fbclient.dll) and then times out on the call to WinSock2 select (I guess it
retries if no success, because it never returns).
MON$STATEMENT shows a stalled SQL statement corresponding to the SQL statement
above.
So it all boils down to is this behavior due to bad transaction management on
our part or is it an issue with firebird?
Background + additional info:
I have been looking into an issue reported by one customer where the system
seem to deadlock when working with a specific table. In the configuration used
the system has four clients operating against a server which in turn utilizes
Firebird.
Now the clients basically displays the table as a grid. The user can "open" a
row in order to display more detailed information. When this occurs the server
changes the state of the row (we have a column in the databese for this) to
open. The user choose to perform an operation on the row (changing the same
column) and closes the detailed view again changing status of the row. For each
status change the clients will reload the row in order display updated
information.
Now everything seems to be working OK when doing work from only one client.
When working with two clients at a moderate pace things are also working OK.
But at a high (to frantic) pace a deadlock will occur within 15 minutes. Which
might suggest some kind of race condition. The thing is that these status
changes I mentioned when describing the "work flow" can be done with the
keyboard using a command sequence similar to [Ctrl+O], [Ctrl+O], [Enter]
allowing the user to change the status 3 times in less than a second. And when
I write high pace thats what I mean. So in order to reproduce the issue(at the
office) I need two users hammering away at their keyboards for 15 minutes.
Please note that the table in this case contains less than 1 rows,
sometimes as few as a 1000 rows. However at the customers site the problem is
easier to reproduce. There the issue seem to occur when having one user working
at high pace and one doing the work flow at lets say 15 seconds per row. The
hardware is the same in these cases(model,manufacturer etc), the users are
different though so maybe the reason why it's harder to reproduce at the office
(the network should have the same topology etc.)
So, since this seemed to be a deadlock which I presumed occurred within the
scope of our server I created memory dump which showed a deadlock where one
server side thread had acquired a lock in our server and then seemed to be
stuck on the winsock call 'select' but I could not get at decent stacktrace due
to missing symbol files. I have since then changed the firebird to version
2.5.6/downloaded symbolfiles and also inserted some very basic trace-logging
around the suspected culprit(the select call). I have also set the network
timeout values in the firebird.conf since code review in fbclient showed these
might be used as timeout parameter to the winsock2 select call.
I have also added a some code to make sure a blocking call to select doesn't
occur (always set the timeout if not set). However the timeout used when
reproducing the issue comes from the firebird.conf file. After these
modifications I have deduced the following:
Info from memory dump and my tracelogs etc are available upon request.
Thank you in advance!
/John Karlsson