Further to this, we've had a few more incidents this past week which
shed a little more light on the problem:
Sometimes, a transmit/receive bind to a carrier (this has happened
with 2 different carriers now) goes strange: MTs will start queuing
on Kannel, seemingly unable to be delivered, but MOs are processed
fine. However, my initial investigation with ethereal doesn't show
any submit_sm attempts for the queued MTs. Also, I don't see any
problems with enquire_links or any (failed) attempts to rebind (e.g.
the transmit bind having problems while the receive bind is ok) -
Kannel indicates that the link is online, and there are no errors
specific to that link in bearerbox.log
If I try to use the HTTP stop-smsc command to stop the individual
bind, Kannel appears to hang - the admin webpage doesn't respond, and
Kannel seems to stop processing MOs and MTs. At this point, the only
thing I can do is restart Kannel via its init script. Restarting
Kannel fixes the original problem, and queued MTs are delivered.
However, shutting down Kannel in this 'hung' state seems to trigger a
crash - here are some excerpts from bearerbox.log from the last few
incidents:
...
2006-08-26 01:17:18 [14077] [3] INFO: HTTP: Shutting down smsc-id
`hutchthree'
2006-08-26 01:19:57 [14077] [0] WARNING: Killing signal or HTTP admin
command received, shutting down...
2006-08-26 01:19:57 [14077] [29] INFO: Connection closed by the box
<127.0.0.1>
2006-08-26 01:19:57 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [4] WARNING: smsbox_list empty!
2006-08-26 01:19:57 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:58 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:19:59 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:00 [14077] [15] WARNING: smsbox_list empty!
2006-08-26 01:20:01 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:02 [14077] [26] WARNING: smsbox_list empty!
2006-08-26 01:20:02 [14077] [26] PANIC: gwlib/gw-rwlock.c:137:
gw_rwlock_rdlock: Assertion `lock != NULL' failed.
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
(gw_panic+0xa3) [0x80b134b]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
(gw_rwlock_rdlock+0x63) [0x80a8a6b]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
(route_incoming_to_boxc+0x5c) [0x8055208]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
(bb_smscconn_receive+0xe5) [0x80577b1]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807a44c]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807b10b]
2006-08-26 01:20:02 [14077] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x80a906e]
2006-08-26 01:20:02 [14077] [26] PANIC: /lib/tls/libpthread.so.0
[0xa1b1d5]
2006-08-26 01:20:02 [14077] [26] PANIC: /lib/tls/libc.so.6(__clone
+0x5a) [0x7782da]
2006-08-26 01:20:18 [11540] [0] INFO: Added logfile `/home/kannel/log/
bearer.log' with level `1'.
2006-08-26 01:20:18 [11540] [0] INFO: Started access logfile `/home/
kannel/log/kannel_sms_traffic.log'.
2006-08-26 01:20:18 [11540] [0] INFO: SSL not supported, no SSL
initialization done.
2006-08-26 01:20:18 [11540] [0] INFO: HTTP: Opening server at port
13000.
2006-08-26 01:20:18 [11540] [0] INFO: BOXC: 'smsbox-max-pending' not
set, using default (100).
...
...
2006-08-26 01:21:37 [11540] [0] WARNING: Killing signal or HTTP admin
command received, shutting down...
2006-08-26 01:21:39 [11540] [4] WARNING: smsbox_list empty!
2006-08-26 01:21:39 [11540] [0] PANIC: Cannot die by its own will
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(gw_panic
+0xa3) [0x80b134b]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x8050b4e]
2006-08-26 01:21:39 [11540] [0] PANIC: /lib/tls/libpthread.so.0
[0xa218a8]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x807bbd8]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(smsc2_shutdown
+0x41) [0x8056e0d]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(bb_shutdown
+0x9e) [0x8050e56]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox(main+0x892)
[0x805175a]
2006-08-26 01:21:39 [11540] [0] PANIC: /lib/tls/libc.so.6
(__libc_start_main+0xe3) [0x6c7e33]
2006-08-26 01:21:39 [11540] [0] PANIC: ./sbin/bearerbox [0x8050a3d]
2006-08-26 01:21:51 [12104] [0] INFO: Added logfile `/home/kannel/log/
bearer.log' with level `1'.
2006-08-26 01:21:51 [12104] [0] INFO: Started access logfile `/home/
kannel/log/kannel_sms_traffic.log'.
2006-08-26 01:21:51 [12104] [0] INFO: SSL not supported, no SSL
initialization done.
2006-08-26 01:21:51 [12104] [0] INFO: HTTP: Opening server at port
13000.
2006-08-26 01:21:51 [12104] [0] INFO: BOXC: 'smsbox-max-pending' not
set, using default (100).
...
...
2006-08-28 14:02:58 [19145] [3] INFO: HTTP: Shutting down smsc-id
`vodafone000'
2006-08-28 14:03:34 [19145] [29] INFO: Connection closed by the box
<127.0.0.1>
2006-08-28 14:03:34 [19145] [0] WARNING: Killing signal or HTTP admin
command received, shutting down...
2006-08-28 14:03:35 [19145] [13] WARNING: smsbox_list empty!
2006-08-28 14:03:35 [19145] [4] WARNING: smsbox_list empty!
2006-08-28 14:03:35 [19145] [26] PANIC: gwlib/gw-rwlock.c:137:
gw_rwlock_rdlock: Assertion `lock != NULL' failed.
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
(gw_panic+0xa3) [0x80b134b]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
(gw_rwlock_rdlock+0x63) [0x80a8a6b]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
(route_incoming_to_boxc+0x5c) [0x8055208]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
(bb_smscconn_receive+0xe5) [0x80577b1]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807a44c]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x807b10b]
2006-08-28 14:03:35 [19145] [26] PANIC: /opt/kannel/sbin/bearerbox
[0x80a906e]
2006-08-28 14:03:35 [19145] [26] PANIC: /lib/tls/libpthread.so.0
[0xa1b1d5]
2006-08-28 14:03:35 [19145] [26] PANIC: /lib/tls/libc.so.6(__clone
+0x5a) [0x7782da]
2006-08-28 14:03:40 [6758] [0] INFO: Added logfile `/home/kannel/log/
bearer.log' with level `1'.
2006-08-28 14:03:40 [6758] [0] INFO: Started access logfile `/home/
kannel/log/kannel_sms_traffic.log'.
2006-08-28 14:03:40 [6758] [0] INFO: SSL not supported, no SSL
initialization done.
2006-08-28 14:03:40 [6758] [0] INFO: HTTP: Opening server at port 13000.
2006-08-28 14:03:40 [6758] [0] INFO: BOXC: 'smsbox-max-pending' not
set, using default (100).
...
This is new behaviour - we haven't seen anything like this since we
started using Kannel (~2 years ago).
The only thing that has changed recently (and even then, it changed a
couple of months ago, and this problem only started a couple of weeks
ago at most) is that we're now using Kannel 'cvs-20060727' instead of
1.4.0 with a few patches (such as improved SMPP throttling,
customised smsbox 'success' HTTP response, etc.) and are now running
on Fedora Core 3 instead of Red Hat Linux 8.0. Also, it's *possible*
that the carriers may have made some changes on their end - one of
them made some mention about upgrading to support SMPP 3.4?
1.) Does anyone have any further ideas what the problem might be?
2.) Can anyone suggest the best way to get more information about
what's happened to the SMPP bind, and what Kannel is actually doing
just before, and during the time when it 'hangs'? I'm not overly
familiar with Kannel internals, or how to go about debugging it, so
any assistance would be greatly appreciated :)
Thanks,
On 24/08/2006, at 5:28 PM, Giulio Harding wrote:
Stuart's just left for the day, but I've picked out line 1811 in
function io_thread in gw/smsc/smsc_smpp.c from the source code
we're using:
...
/* unbind
* Read so long as unbind_resp received or timeout
passed. Otherwise
we have
* double delivered messages.
*/
if (smpp->quitting) {
send_unbind(smpp, conn);
last_response = time(NULL);
while(conn_wait(conn, 1.00) != -1 &&
difftime(time(NULL), last_response) <
SMPP_DEFAULT_SHUTDOW
N_TIMEOUT &&
smpp->conn->status != SMSCCONN_DISCONNECTED) {
if (read_pdu(smpp, conn, &len, &pdu) == 1) {
dump_pdu("Got PDU:", smpp->conn->id, pdu);
handle_pdu(smpp, conn, pdu, &pending_submits);
smpp_pdu_destroy(pdu);
}
}
debug("bb.sms.smpp", 0, "SMPP[%s]: %s: break and
shutting down",
octstr_get_cstr(smpp->conn->id), __func__);
<<<<================== Line1811
break;
}
send_enquire_link(smpp, conn, &last_enquire_sent);
...
Hope this helps...
(just on a side-note, and this may be related: on the day this
crash occured, later at night, we had an incident with the same
carrier where one of our transmit binds to them died - they were
rejecting our bind enquiries and then bind requests claiming we
were already bound - when I tried to restart the individual link
using the stop-smsc and start-smsc HTTP commands, Kannel hung, no
longer processing MOs/MTs, and the admin page not responding. I
tried to stop kannel, in order to restart it, but stopping it
caused ~30 bearerbox processes to appear, each of which had to be
kill -9'd before I could restart kannel. Not sure if this would
related or not, but I haven't had a chance to investigate
further... just thought I'd mention it in case...)
Thanks,
Andreas Fink wrote:
I was quickly looking through the CVS code of today and wonder if
this was maybe created by the fact that login is being rejected
but enquire link tries to send its message anyway and then crashes.
Could you tell me exactly on your CVS 20060727 version where is
line 1811 in that version?
On 24.08.2006, at 06:03, Mi Reflejo wrote:
Hi,
Which version do you use? If you are not using CVS HEAD please
try it.
Regards,
Martin Conte.
On 8/23/06, Stuart Beck <[EMAIL PROTECTED]
<mailto:[EMAIL PROTECTED]>> wrote:
Hi All,
We recently had a situation where one of the carriers we are
connecting to (2 separate SMSC's) had some unspecified issue and
reconnected, this however caused the following panic
as one of the SMSC's had been turned off (upon communicating
with the carrier it's been switched off for about a week, they
are no longer supporting that service)
I would like to know if anyone can give me any indication as to
why the failure occurred and what can be done about it.
The Kannel/host details are as follows
Kannel bearerbox version `cvs-20060727'.
Build `Aug 2 2006 12:31:01', compiler `3.2 20020903 (Red
Hat Linux 8.0 3.2-7)'.
System Linux, release 2.4.20-19.8smp, version #1 SMP Tue
Jul 15 15:01:43 EDT 2003, machine i686.
Libxml version 2.6.9. Using OpenSSL 0.9.6b [engine] 9
Jul 2001. Using native malloc.
> 2006-08-23 14:39:16 [32374] [38] ERROR: SMPP[hutchthree]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32374] [38] ERROR: SMPP[hutchthree]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:16 [32448] [40] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32448] [40] ERROR: SMPP[hutchorange]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:16 [32447] [39] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32447] [39] ERROR: SMPP[hutchorange]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:16 [32373] [37] ERROR: SMPP[hutchthree]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:16 [32373] [37] ERROR: SMPP[hutchthree]:
Couldn't connect to SMS center (retrying in 10 seconds).
> 2006-08-23 14:39:26 [32448] [40] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 14:39:26 [32448] [40] ERROR: SMPP[hutchorange]:
SMSC rejected login to receive, code 0x0000000f (Invalid System
ID).
> 2006-08-23 14:39:26 [32448] [40] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 14:39:26 [32447] [39] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 14:39:26 [32447] [39] ERROR: SMPP[hutchorange]:
SMSC rejected login to transmit, code 0x0000000f (Invalid System
ID).
> 2006-08-23 16:24:12 [4905] [3] INFO: HTTP: Re-starting smsc-id
`hutchorange'
> 2006-08-23 16:24:12 [4905] [3] INFO: Set throughput to 13.000
for smsc id <hutchorange>
> 2006-08-23 16:24:12 [4905] [3] INFO: DLR rerouting for smsc id
<hutchorange> disabled.
> 2006-08-23 16:24:13 [16991] [42] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:13 [16991] [42] ERROR: SMPP[hutchorange]:
SMSC rejected login to receive, code 0x0000000f (Invalid System
ID).
> 2006-08-23 16:24:13 [16991] [42] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 16:24:13 [16990] [41] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:13 [16990] [41] ERROR: SMPP[hutchorange]:
SMSC rejected login to transmit, code 0x0000000f (Invalid System
ID).
> 2006-08-23 16:24:38 [4905] [3] INFO: HTTP: Re-starting smsc-id
`hutchorange'
> 2006-08-23 16:24:38 [4905] [3] INFO: Set throughput to 13.000
for smsc id <hutchorange>
> 2006-08-23 16:24:38 [4905] [3] INFO: DLR rerouting for smsc id
<hutchorange> disabled.
> 2006-08-23 16:24:38 [17116] [44] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:38 [17116] [44] ERROR: SMPP[hutchorange]:
SMSC rejected login to receive, code 0x0000000f (Invalid System
ID).
> 2006-08-23 16:24:38 [17116] [44] ERROR: SMPP[hutchorange]: I/O
error or other error. Re-connecting.
> 2006-08-23 16:24:38 [17115] [43] WARNING: SMPP: PDU NULL
terminated string has no NULL.
> 2006-08-23 16:24:38 [17115] [43] ERROR: SMPP[hutchorange]:
SMSC rejected login to transmit, code 0x0000000f (Invalid System
ID).
> 2006-08-23 16:24:43 [16990] [41] PANIC: gwlib/octstr.c:2461:
seems_valid_real: Assertion `ostr->len + 1 <= ostr->size'
failed. (Called from gw/smsc/smsc_smpp.c:1811:io_thread.)
> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/
bearerbox(gw_panic+0xfd) [0x80b794d]
> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/
bearerbox [0x80bcddc]
> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/
bearerbox(octstr_get_cstr_real+0x20) [0x80b8bf0]
> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/
bearerbox [0x8080b5e]
> 2006-08-23 16:24:43 [16990] [41] PANIC: /opt/kannel/sbin/
bearerbox [0x80af177]
> 2006-08-23 16:24:43 [16990] [41] PANIC: /lib/i686/
libpthread.so.0 [0x400b1881]
> 2006-08-23 16:24:43 [16990] [41] PANIC: /lib/i686/libc.so.6
(__clone+0x57) [0x420e40c7]
--
Stuart Beck
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
--
Giulio Harding
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]
http://www.mnetcorporation.com
--
Giulio Harding
Systems Administrator
m.Net Corporation
Level 13, 99 Gawler Place
Adelaide SA 5000, Australia
Tel: +61 8 8210 2041
Fax: +61 8 8211 9620
Mobile: 0432 876 733
MSN: [EMAIL PROTECTED]
http://www.mnetcorporation.com