Hi Henri,

I have read the other posts to this thread, and it does sound like (at
least) a bug in the Sybase driver.  Now that I know that the Sybase
client has an internal timeout feature, I am more suspicious the we are
running into multiple handlers on the SIGALRM. (one in sybase code).
Although, quickly scanning the strace output, the could have implemented
that with select().  (now... I cann't remember if select itself uses
alarms... though I did not think it did).

On Wed, 2004-11-24 at 22:36 -0800, Henri Asseily wrote:
> On Nov 24, 2004, at 10:14 PM, Lincoln A. Baxter wrote:
> 
> > Hi Henri,
> >
> > I have some questions/avenues for you to pursue:
> >
> > 1) What happens when you change safe=>1 to safe=>0 in this code?
> 
> You end up getting the same as the standard $SIG{ALRM} behavior, i.e. 
> the alarm never triggers.
> 

Hmmm, I think I want to take a closer look at that...  I do suspect the
we are running into an issue with Sybase signal handling in addition to
other things.  But, I want to do a little testing of Sys::SigAction's
safe flag in this case.  Can you construct a script the does this that I
might be able to try against or DBD-Oracle? (and send me your latest HA
module .. if it is needed and it not on CPAN).

> >
> > 2) What happens if you close the entire dbh at this point (reopen it
> > later)?  -- its a thought?
> 
> I don't know, but I certainly do not want that (which is why I didn't 
> try it). The concept is to do the execute with a timeout. If the 
> timeout triggers, retry a "select 1". If that fails, then we assume the 
> db is dead and switch to another one. If it succeeds, then either the 
> statement is wrong or the database is overloaded, and I still have to 
> determine the correct course of action. But switching to another 
> database server automatically is not correct.

Tim recommended cleaning up the entire dbh in another message, and I
would too, even after the DBD-Sybase bug is fixed. Even with the safe
flag we have to assume that signals are inherently unsafe.  That is how
we handle all DB timeouts on a Database in code we have written.  

I think that doing a "select 1" after a timeout should really be
revisited.  What are you going to do if that succeeds, do the original
execute again? What if that hangs again? Are you keeping a counter?  If
so, I think you are headed down the wrong path.  I think you should
immediately give up, and cleanup.  All the comments about possible
corruption (using signals) that Tim made not withstanding, if you have
timed out a database operation, it is probably because the operation is
flawed in its design, or because the DB is sick or way too busy.
Anything else you do (other than closing it) has the potential to make
it worse (even select 1).  Closing the database connection is about the
only thing safe thing you can do on the client side, that _MIGHT_ make
it better -- primarily because you would give the DB engine a chance to
reclaim some resources, and heal itself.

Lincoln


Reply via email to