Hi Henri, I have read the other posts to this thread, and it does sound like (at least) a bug in the Sybase driver. Now that I know that the Sybase client has an internal timeout feature, I am more suspicious the we are running into multiple handlers on the SIGALRM. (one in sybase code). Although, quickly scanning the strace output, the could have implemented that with select(). (now... I cann't remember if select itself uses alarms... though I did not think it did).
On Wed, 2004-11-24 at 22:36 -0800, Henri Asseily wrote: > On Nov 24, 2004, at 10:14 PM, Lincoln A. Baxter wrote: > > > Hi Henri, > > > > I have some questions/avenues for you to pursue: > > > > 1) What happens when you change safe=>1 to safe=>0 in this code? > > You end up getting the same as the standard $SIG{ALRM} behavior, i.e. > the alarm never triggers. > Hmmm, I think I want to take a closer look at that... I do suspect the we are running into an issue with Sybase signal handling in addition to other things. But, I want to do a little testing of Sys::SigAction's safe flag in this case. Can you construct a script the does this that I might be able to try against or DBD-Oracle? (and send me your latest HA module .. if it is needed and it not on CPAN). > > > > 2) What happens if you close the entire dbh at this point (reopen it > > later)? -- its a thought? > > I don't know, but I certainly do not want that (which is why I didn't > try it). The concept is to do the execute with a timeout. If the > timeout triggers, retry a "select 1". If that fails, then we assume the > db is dead and switch to another one. If it succeeds, then either the > statement is wrong or the database is overloaded, and I still have to > determine the correct course of action. But switching to another > database server automatically is not correct. Tim recommended cleaning up the entire dbh in another message, and I would too, even after the DBD-Sybase bug is fixed. Even with the safe flag we have to assume that signals are inherently unsafe. That is how we handle all DB timeouts on a Database in code we have written. I think that doing a "select 1" after a timeout should really be revisited. What are you going to do if that succeeds, do the original execute again? What if that hangs again? Are you keeping a counter? If so, I think you are headed down the wrong path. I think you should immediately give up, and cleanup. All the comments about possible corruption (using signals) that Tim made not withstanding, if you have timed out a database operation, it is probably because the operation is flawed in its design, or because the DB is sick or way too busy. Anything else you do (other than closing it) has the potential to make it worse (even select 1). Closing the database connection is about the only thing safe thing you can do on the client side, that _MIGHT_ make it better -- primarily because you would give the DB engine a chance to reclaim some resources, and heal itself. Lincoln