Hi Alan,

I wholeheartedly agree. Consider the case that it cannot find a specific DLR, everything else (DB, etc.) being fine, it shouldn't panic. This is the normal behaviour with the other db drivers (not to panic), when people have wrong msg-id-type, or dlr entry is lost upon bb restart, or SMSc sends in DLR before the submit_sm_resp with the DLR-id.

The statement:

if (result == NULL...)
{
   while(row = gwlist_extract_first(result)) -> lock(result)

it doesn't make any sense at all. The faster it is replaced the better.

+1

BR,
Nikos
----- Original Message ----- From: "Alan McNatty" <a...@catalyst.net.nz>
To: "Nikos Balkanas" <nbalka...@gmail.com>
Cc: <devel@kannel.org>
Sent: Wednesday, August 10, 2011 12:08 PM
Subject: Re: panic inducing use of gwlist_extract_first in dlr_pgsql.c


Hi Nikos,

So do you agree that we should avoid panic'ing as a result of a
temporary situation (as outlined with db connection dropping)? That is -
the patch is good?

Cheers,
Alan

On Wed, 2011-08-10 at 11:03 +0300, Nikos Balkanas wrote:
That is a well known behavior. Bb crashes and stops responding to the
heartbeats that smsbox sends. As a result, smsbox logs in "bearerbox
gone, shutting down" and shuts down. The parent bb process should
handle heartbeats, not the child.



HTH,
Nikos

On Wed, Aug 10, 2011 at 9:56 AM, Alan McNatty <a...@catalyst.net.nz>
wrote:
        Also note: a side effect of the current behaviour (panic when
        DB
        temporarily unavailable) when --parachute used at start-up is
        that
        smsbox will be shutdown as a result of the panic but bearerbox
        will come
        back online when the DB is available again (as the --parachute
        keeps it
        alive). The result being a running bearerbox without any
        smsbox(es)
        attached.

        On Wed, 2011-08-10 at 16:51 +1200, Alan McNatty wrote:
        > Hi All,
        >
        > I'm finding what I think is incorrect use of
        gwlist_extract_first in the
        > postgres dlr implementations (it may also exist in others -
        I've not
        > checked yet). The DLR methods issue 'error's when they fail
        to return
        > results, etc but subsequent calls to gwlist_extract_first on
        NULL lists
        > cause 'panic's.
        >
        > What I'm testing is the situation when the DLR DB is
        available on
        > start-up (we panic if it is not). If during during normal
        operation the
        > database is shutdown or temporarily unavailable (network
        issue, etc).
        > The select fail is an error but results in a panic.
        >
        > 2011-08-10 16:37:43 [18552] [3] ERROR: PGSQL: SELECT
        count(*) FROM
        > "dlr";
        > 2011-08-10 16:37:43 [18552] [3] ERROR: PGSQL: FATAL:
         terminating
        > connection due to administrator command
        > server closed the connection unexpectedly
        >       This probably means the server terminated abnormally
        >       before or while processing the request.
        >
        > 2011-08-10 16:37:43 [18552] [3] ERROR: PGSQL: Select failed!
        > 2011-08-10 16:37:43 [18552] [3] ERROR: PGSQL: Could not get
        count of DLR
        > table
        > 2011-08-10 16:37:43 [18552] [3] PANIC: gwlib/list.c:309:
        > gwlist_extract_first: Assertion `list != NULL' failed.
        > 2011-08-10 16:37:43 [18552] [3]
        PANIC: /usr/sbin/bearerbox(gw_panic
        > +0x14b) [0x48b55b]
        > 2011-08-10 16:37:43 [18552] [3]
        > PANIC: /usr/sbin/bearerbox(gwlist_extract_first+0x94)
        [0x489874]
        > 2011-08-10 16:37:43 [18552] [3] PANIC: /usr/sbin/bearerbox
        [0x41e3d3]
        > 2011-08-10 16:37:43 [18552] [3]
        > PANIC: /usr/sbin/bearerbox(bb_print_status+0x11d) [0x40edfd]
        > 2011-08-10 16:37:43 [18552] [3] PANIC: /usr/sbin/bearerbox
        [0x415075]
        > 2011-08-10 16:37:43 [18552] [3] PANIC: /usr/sbin/bearerbox
        [0x4823cf]
        > 2011-08-10 16:37:43 [18552] [3] PANIC: /lib/libpthread.so.0
        > [0x2b0e670a9fc7]
        > 2011-08-10 16:37:43 [18552] [3] PANIC: /lib/libc.so.6(clone
        +0x6d)
        > [0x2b0e67a8664d]
        >
        > The attached patch addresses this (for postgres
        implementation only - I
        > can check the others if required). Once applied The result
        on the status
        > page is ..
        >
        > DLR: -1 queued, using pgsql storage
        >
        > And when a DLR is received ...
        >
        > 2011-08-10 16:44:53 [18889] [11] ERROR: PGSQL: FATAL:
         terminating
        > connection due to administrator command
        > server closed the connection unexpectedly
        >       This probably means the server terminated abnormally
        >       before or while processing the request.
        >
        > 2011-08-10 16:44:53 [18889] [11] ERROR: PGSQL: Select
        failed!
        > 2011-08-10 16:44:53 [18889] [11] DEBUG: no rows found
        > 2011-08-10 16:44:53 [18889] [11] WARNING: DLR[pgsql]: DLR
        from SMSC<FOO>
        > for DST<02xxxxxxxxx> not found.
        > 2011-08-10 16:44:53 [18889] [11] ERROR: SMPP[FOO]: got DLR
        but could not
        > find message or was not interested in it id<534001841355>
        > dst<02xxxxxxxxx>, type<1>
        >
        > Cheers,
        > Alan









Reply via email to