On Wed, Nov 24, 2004 at 04:52:08PM -0800, Henri Asseily wrote:
> I have slammed into a wall in my quest for reliable failover and high 
> availability in DBI. I don't know if this discussion should be in 
> dbi-users or dbi-dev, but here goes:

It's a user problem, albeit an advanced one, so dbi-users is best.

> High availability necessitates a good timeout handling system. If 
> execution of an sql statement or stored procedure takes too long, one 
> should have the opportunity to kill it and fail over to a less 
> overloaded server.
> 
> One problem is in the timeout handling in Perl (and Unix in general). 
> The standard $SIG{ALRM} technique utterly fails when trying to trap 
> $sth->execute(), and never gets triggered.
> That problem has now been resolved thanks to Lincoln Baxter's excellent 
> Sys::SigAction module (at least for Unix machines) which utilizes all 
> the techniques (POSIX sigaction, SIGALRM...) to ensure proper signal 
> handling.
> 
> But there's another more subtle problem that I only today finally 
> managed to get to the bottom of:
> 
> Assuming you use Sys::SigAction and you properly trap the execute() 
> call, you get nailed by DBI's aggressive sanity checking.

You're jumping to conclusions. The "aggressive sanity checking" is
helping you here.

> Suppose you have code like the following (copied from my upcoming 
> DBIx::HA 0.9x module):
> 
> eval {
>    my $h = set_sig_handler(
>             'ALRM',
>             sub { $timeout = 1; die 'TIMEOUT'; },
>             { mask=>['ALRM'],
>             safe=>1 }
>           );
>    alarm(10);
>    $res = $sth->SUPER::execute;
>    alarm(0);
> };
> alarm(0);
> 
> 
> If the alarm is triggered, then your statement handle ($sth) gets 
> automatically corrupted with no way to get rid of it. This in turn will 
> continuously add active kids to your database handle and corrupt everything.

http://search.cpan.org/~timb/DBI/DBI.pm#Signal_Handling_and_Canceling_Operations
explains that non-deferred signal handling is fundamentally unsafe
and that "the handles in use at the time the signal is handled may
not be safe to use afterwards".

These are not limitations of the DBI. They are fundamental to signal
handling in any language.


> Below is the result of triggering the above alarm:
> 
>  null:        (in cleanup) dbih_setup_fbav: invalid number of fields: 
> -1, NUM_OF_FIELDS attribute probably not set right at ....

dbih_setup_fbav is called only when you bind columns or fetch a row.
You shouldn't be doing either of those if the execute() failed.
Using trace() should help track down what's happening.

> null: DBI handle 0xabf1038 cleared whilst still active at ...

I'd guess that the driver is marking the handle Active before
it waits for results from the database. (Or perhaps the execute
is a re-execute() of an already Active handle.) Need a trace().

> null: DBI handle 0xabf1038 has uncleared implementors data at ...

Looks like a bug in DBD::Sybase. When syb_st_destroy is called during
global destruction it does nothing. It should *at least* call
DBIc_IMPSET_off(imp_sth); That tells the DBI that the driver has
properly cleaned up, then the DBI won't complain about it.

> The statement handle was created but was never populated with the 
> execution results, so it's in a weird half-alive state.
> For example, the DBIc_NUM_FIELDS is -1, which makes dbih_setup_fbav() 
> croak. Similarly, DBIc_ACTIVE is still true.

The fact that dbih_setup_fbav is called seems most likely to be a
bug in your code. The fact that Active is set seems most likely to
be a bug in DBD::Sybase.

> What's the best strategy to deal with these zombies?

Fix the bugs :)

Even when you've done that there's still a remaining issue: "handles
in use at the time the signal is handled may not be safe to use
afterwards".

That's actually not worded strongly enough. If the signal fired
while using a statement handle then the corresponding database
handle may not be safe to use. (Not because of any problem in
the DBI that I'm aware of, but because the database client code
and/or driver may have been left in a corrupted state.)

For reliability you'll need to destroy the database handle and
reconnect.

Tim.

Reply via email to