[Ntop-dev] RE: [Ntop] A lot of "WARNING Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...

Burton Strauss Tue, 09 Aug 2005 09:55:48 -0700

Title: A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...

(Moved from ntop)

The 'magic number' is a traditional magic number. It's just a data field in the Host Traffic structure with a specific value (Luca likes 1968 for some reason).

With pointer based structures, this is a good idea - it's one more sanity check.

So it sounds like you've got pointer problems. That's not good...

None of the following explains what's wrong - it's all about symptoms...

I think you are right - the NPA (Network Packet Analyzer) or NPS (Network Packet Sniffing) thread(s) has/have died and that's why purge gradually ages every host out of the structures - it's not seeing any more traffic.

I'd bet NPA because of this:

Packets

Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)

Queued.....0 (0.0 %)

(Where are the packets going - the path that increments received but not process is this from queuePacket():

void queuePacket(u_char *_deviceId,
const struct pcap_pkthdr *h,
const u_char *p) {
int len, deviceId, actDeviceId;

...

myGlobals.receivedPackets++;

...

if(myGlobals.ntopRunState > FLAG_NTOPSTATE_RUN) return;

...

}

So, I wonder if we were in stopcap? If so, it should have been indicated at the bottom of every web page. (I'm going to commit a patch to add the run state to the info.html / textinfo.html page.)

You can use gdb to connect to the running ntop. For example:

09:45:16 tigger [Linux] user=ntop pwd=~ $ su -
Password:
[EMAIL PROTECTED] root]# ps axf | grep ntop
2643 ?        S      0:00 \_ sshd: ntop [priv]
2645 ?        S      0:00      \_ sshd: [EMAIL PROTECTED]/0
2218 ?        S     93:14 /usr/bin/ntop -i eth1,eth2 @/etc/ntop.conf -d --use-syslog local3
2252 ?        S      0:00 login -- ntop
1077 ?        S     22:27 /devel/bin/ntop -i eth1,eth2 @/etc/ntopdevel.conf -d --use-syslog local4 --instance devel
[EMAIL PROTECTED] root]# gdb /devel/bin/ntop 1077
GNU gdb Red Hat Linux (6.0post-0.20040223.19rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "i386-redhat-linux-gnu"...Using host libthread_db library "/lib/tls/libthread_db.so.1".

Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
(gdb)

Check the run mode:

(gdb) info stack
#0 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00185986 in __nanosleep_nocancel () from /lib/tls/libpthread.so.0
#2 0x009818ad in _ntopSleepMSWhileSameState (file=Variable "file" is not available.
) at util.c:3977
#3 0x009819af in _ntopSleepWhileSameState (file=0xfffffdfc <Address 0xfffffdfc out of bounds>, line=-516,
    uSeconds=10) at util.c:4006
#4 0x0804ae0a in main (argc=8, argv=0xbff2e7a4) at main.c:588
(gdb) frame 2
#2 0x009818ad in _ntopSleepMSWhileSameState (file=Variable "file" is not available.
) at util.c:3977
3977            if((nanosleep(&sleepAmount, &remAmount) != 0) && (errno == EINTR)) {(gdb) print myGlobals.ntopRunState
$4 = 4

Then you can use these commands:

(gdb) info threads
10 Thread 43310000 (LWP 1080) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
9 Thread 53799856 (LWP 1081) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
8 Thread 64289712 (LWP 1082) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
7 Thread 74779568 (LWP 1083) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
6 Thread 85269424 (LWP 1086) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
5 Thread 95759280 (LWP 1087) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
4 Thread 106249136 (LWP 1088) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
3 Thread 125029296 (LWP 1089) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
2 Thread 145066928 (LWP 1090) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
1 Thread -1208669824 (LWP 1077) 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2

Those thread #s (although gdb displays them as signed) map to the THREADMGMT messages you see in ntop's log:

09:48:38 tigger [Linux] user=ntop pwd=~ $ grep THREADMGMT /var/log/ntopdevel.1
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t3086297472]: Now running as a daemon
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t43310000]: NPA: Started thread for network packet analyzer
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t53799856]: SFP: Started thread for fingerprinting
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t64289712]: SIH: Started thread for idle hosts detection
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t74779568]: DNSAR(1): Started thread for DNS address resolution
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t43310000]: NPA: network packet analyzer (packet processor) thread running [p1077]
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t53799856]: SFP: Fingerprint scan thread starting [p1077]
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t64289712]: SIH: Idle host scan thread starting [p1077]
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t74779568]: DNSAR(1): Address resolution thread running [p1077]
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t85269424]: INITWEB: Started thread for web server
Aug 8 13:02:41 tigger devel[1077]:   THREADMGMT[t85269424]: WEB: Server connection thread starting [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t85269424]: WEB: Server connection thread running [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT: RRD: Started thread (t95759280) for data collection
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t3086297472]: ntop RUNSTATE: INITNONROOT(3)
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t3086297472]: ntop RUNSTATE: RUN(4)
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t106249136]: NPS(1): Started thread for network packet sniffing
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t53799856]: SFP: Fingerprint scan thread running [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t95759280]: RRD: Data collection thread starting [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t64289712]: SIH: Idle host scan thread running [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t125029296]: NPS(2): Started thread for network packet sniffing
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t106249136]: NPS(1,eth1): pcapDispatch thread starting [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t106249136]: NPS(1,eth1): pcapDispatch thread running [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t125029296]: NPS(2,eth2): pcapDispatch thread starting [p1077]
Aug 8 13:02:42 tigger devel[1077]:   THREADMGMT[t125029296]: NPS(2,eth2): pcapDispatch thread running [p1077]
Aug 8 13:02:52 tigger devel[1077]:   THREADMGMT[t145066928]: RRD: Started thread for throughput data collection
Aug 8 13:02:52 tigger devel[1077]:   THREADMGMT[t95759280]: RRD: Data collection thread running [p1077]
Aug 8 13:02:52 tigger devel[1077]:   THREADMGMT[t145066928]: RRD: Throughput data collection: Thread starting [p1077]
Aug 8 13:02:52 tigger devel[1077]:   THREADMGMT[t145066928]: RRD: Throughput data collection: Thread running [p1077]

NPA is 43310000 and NPS are 106249136/125029296 (I run on two NICs). If you don't see them in your gdb output, they've died.

If they are still there, look at what their status is:

(gdb) thread 10
[Switching to thread 10 (Thread 43310000 (LWP 1080))]#0 0x002787a2 in _dl_sysinfo_int80 ()
from /lib/ld-linux.so.2
(gdb) info stack
#0 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00182922 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0
#2 0x0097cd4c in waitCondvar (condvarId=Variable "condvarId" is not available.
) at util.c:2028
#3 0x0096b6b0 in dequeuePacket (notUsed=0x0) at pbuf.c:2407
#4 0x0017f98c in start_thread () from /lib/tls/libpthread.so.0
#5 0x005587da in clone () from /lib/tls/libc.so.6

(That's normal - it's the wait for a packet from dequeuePacket)

(gdb) thread 4
[Switching to thread 4 (Thread 106249136 (LWP 1088))]#0 0x002787a2 in _dl_sysinfo_int80 ()
from /lib/ld-linux.so.2
(gdb) info stack
#0 0x002787a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00185538 in recvfrom () from /lib/tls/libpthread.so.0
#2 0x002c6788 in pcap_open_live () from /usr/lib/libpcap.so.0.8.3
#3 0x002c812b in pcap_loop () from /usr/lib/libpcap.so.0.8.3
#4 0x0095ba03 in pcapDispatch (_i=0x0) at ntop.c:97
#5 0x0017f98c in start_thread () from /lib/tls/libpthread.so.0
#6 0x005587da in clone () from /lib/tls/libc.so.6

(Similarly normal - pcapDispatch into the libpcap routines)

Back to what's wrong...

* There are 10K hosts - so if it's something really, really generic, we would see it much more often and much sooner.

How much memory on the host?

Are you running out?

Please scan back in the log for allocation failure messages.

* When you started to get the bad magic number messages, there were 11 instances.

What changed around 1:44? It looks like there are 18 hosts that were bad and yet most of them were idle at 1:49.

Does 11 ring any bells? You know, back to Mr. Rogers - some of these things are not like the others? Which of these things do not belong??

* Another question is WHICH bad magic number message this is? Can you run with a higher trace level so that we get the file/line info?? (Again, I'm going to be commiting a patch to improve this data).

It looks like the one from hash.c around 802 in lookupHost(), so it could be called from just about anywhere...

Next Steps: Grab the cvs (wait for the 621 and 622 commit messages) and rerun the test - when it goes bad, please do the gdb stuff above, and also grab the updated log messages. This should give us much more info.

-----Burton

From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Mazuhelli
Sent: Tuesday, August 09, 2005 9:25 AM
To: [email protected]
Subject: [Ntop] A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...

Hi,

Ive been having a problem with ntop, both version 3.1 and a version that I installed from the CVS a few days ago. After running for a few hours or a few days, I see tons of messages that look like:

**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]

Any idea what can be causing this? I checked the FAQ and back traffic for a few months and didnt find anything.

With version3.1 I seem to recall that ntop completely crashed at that point. In the CVS version the web server keeps on running but no traffic is being reported the sniffing thread seems to have stopped working and the network load statistics pages show no more traffic.

Here are a few lines from the log starting a few minutes before the incident (which occurred at 01:49:24 for this run):

Aug 9 01:44:31 gbp ntop[30628]: [MSGID8757584] SFP: Ending fingerprint scan cycle 406 - checked 123, resolved 0
Aug 9 01:44:57 gbp ntop[30628]: [MSGID8931710] RRD: Cycle 203 ended, 60 RRDs updated
Aug 9 01:45:23 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1820 [out of 10468] hosts selected
Aug 9 01:45:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1820/10465 hosts deleted, elapsed time is 0.049588 seconds (0.000027 per host)
Aug 9 01:47:01 gbp ntop[30628]: [MSGID8757584] SFP: Ending fingerprint scan cycle 407 - checked 126, resolved 0
Aug 9 01:47:24 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1298 [out of 10094] hosts selected
Aug 9 01:47:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1298/10092 hosts deleted, elapsed time is 0.044375 seconds (0.000034 per host)
Aug 9 01:49:24 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1385 [out of 10360] hosts selected
Aug 9 01:49:24 gbp ntop[30628]: [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug 9 01:49:24 gbp last message repeated 11 times
Aug 9 01:49:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1385/10358 hosts deleted, elapsed time is 0.044406 seconds (0.000032 per host)
Aug 9 01:49:24 gbp ntop[30628]: [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug 9 01:49:31 gbp last message repeated 108227 times

As you can see the number of hosts is going down as inactive hosts are being deleted and no new traffic is being seen. Here are the next few IDLE_PURGE lines:

Aug 9 01:57:25 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1792/3972 hosts deleted, elapsed time is 0.045701 seconds (0.000026 per host)
Aug 9 01:59:26 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 2178/2180 hosts deleted, elapsed time is 0.053174 seconds (0.000024 per host)
Aug 9 02:01:26 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1/2 hosts deleted, elapsed time is 0.000308 seconds (0.000308 per host)
Aug 9 02:03:26 gbp ntop[30628]: [MSGID8439789] IDLE_PURGE: Device eth1: no hosts [out of 1] deleted

Here is a report of my configuration as generated by the show config page. Im sniffing a span port on a Cisco Catalyst 3500 which is located between our firewall and our router to the Internet. Ntop is running on an IBM X335 (xeon) with 1.5gigs of RAM with RedHad 9 nothing else runs on that machine.

Basic Information

ntop Version.....3.1.50
Configured on.....Aug 1 2005 8:20:49
Built on.....Aug 1 2005 08:21:00
OS.....i686-pc-linux-gnu
Running from...../usr/local/ntop3cvs/bin
Libraries in...../usr/local/ntop3cvs/bin
Library path.....(nil)
ntop Process Id.....30628
http Process Id.....30628

Command line

Started as........./usr/local/ntop3cvs/bin/ntop -i eth1 @/etc/ntop3cvs.conf
Resolved to........./usr/local/ntop3cvs/bin/ntop -i eth1 --user ntop --daemon --db-file-path /var/run/ntop3cvs -i eth1 -m xxx.xxx.0.0/16,yyy.yyy.yyy.0/22,yyy.yyy.yyy.0/23,yyy.yyy.yyy.0/24,zzz.zzz.zzz.0/24 --trace-level 5 --use-syslog=local2 --http-server 3000 --https-server 0 --no-mac --track-local-hosts --disable-schedyield --max-table-rows 50 --refresh-time 300 --access-log-file /var/run/ntop3cvs/access.log --skip-version-check

<snip />

Regards,

Marc Mazuhelli, analyste en sécurité informatique
Service des technologies de l'information, Université de Sherbrooke
Adresses de courriel:
[EMAIL PROTECTED] pour signaler des incidents
de sécurité informatique (virus, intrusions, etc.)
[EMAIL PROTECTED] pour m'écrire directement
téléphone: 819-821-8000, poste 1019
télécopieur: 819-821-8045

_______________________________________________
Ntop-dev mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-dev

[Ntop-dev] RE: [Ntop] A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...

Reply via email to

[Ntop-dev] RE: [Ntop] A lot of "WARNING Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...