I have slammed into a wall in my quest for reliable failover and high availability in DBI. I don't know if this discussion should be in dbi-users or dbi-dev, but here goes:

High availability necessitates a good timeout handling system. If execution of an sql statement or stored procedure takes too long, one should have the opportunity to kill it and fail over to a less overloaded server.

One problem is in the timeout handling in Perl (and Unix in general). The standard $SIG{ALRM} technique utterly fails when trying to trap $sth->execute(), and never gets triggered.
That problem has now been resolved thanks to Lincoln Baxter's excellent Sys::SigAction module (at least for Unix machines) which utilizes all the techniques (POSIX sigaction, SIGALRM...) to ensure proper signal handling.


But there's another more subtle problem that I only today finally managed to get to the bottom of:

Assuming you use Sys::SigAction and you properly trap the execute() call, you get nailed by DBI's aggressive sanity checking.

Suppose you have code like the following (copied from my upcoming DBIx::HA 0.9x module):

eval {
   my $h = set_sig_handler(
            'ALRM',
            sub { $timeout = 1; die 'TIMEOUT'; },
            { mask=>['ALRM'],
            safe=>1 }
          );
   alarm(10);
   $res = $sth->SUPER::execute;
   alarm(0);
};
alarm(0);


If the alarm is triggered, then your statement handle ($sth) gets automatically corrupted with no way to get rid of it. This in turn will continuously add active kids to your database handle and corrupt everything.
Below is the result of triggering the above alarm:


null: (in cleanup) dbih_setup_fbav: invalid number of fields: -1, NUM_OF_FIELDS attribute probably not set right at ....

 null: DBI handle 0xabf1038 cleared whilst still active at ...

null: DBI handle 0xabf1038 has uncleared implementors data at ...
    dbih_clearcom (sth 0xabf1038, com 0xaeb79b8, imp DBD::Sybase::st):
       FLAGS 0x180057: COMSET IMPSET Active Warn ChopBlanks PrintWarn
       PARENT DBIx::HA::db=HASH(0xa21e008)
       KIDS 0 (0 Active)
       IMP_DATA undef
       LongReadLen 32768
       NUM_OF_FIELDS -1
       NUM_OF_PARAMS 0


The statement handle was created but was never populated with the execution results, so it's in a weird half-alive state.
For example, the DBIc_NUM_FIELDS is -1, which makes dbih_setup_fbav() croak. Similarly, DBIc_ACTIVE is still true.


Should there be an additional field for a handle that tells us if it's not in a fully active state, and if so then we have carte blanche to wipe it?
What's the best strategy to deal with these zombies?


I can provide a patch when I dig deeper.

H

Reply via email to