Thanks for the rapid and detailed response!
I didn’t restart my ntop instance yet in case it could be used for some tests, so it’s still available for debugging with gdb.
Regarding the packet counts:
After a few more hours, the counts are exactly the same:
I think you are right - the NPA (Network Packet Analyzer) or NPS (Network Packet Sniffing) thread(s) has/have died and that's why purge gradually ages every host out of the structures - it's not seeing any more traffic.
I'd bet NPA because of this:
Packets
Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)
Queued.....0 (0.0 %)
Packets
Received.....310,400,744
Processed Immediately.....199,990,101 (64.4 %)
Queued.....0 (0.0 %)
(I got these from a different browser to make sure it didn’t come from the browser cache). It looks to me like the NPS thread (at least) is dead.
I didn’t see any mention of this in any web page this morning (and I still don’t now). Where whould it appear? In the footer with the “Report created on ...” text?
So, I wonder if we were in stopcap? If so, it should have been indicated at the bottom of every web page. (I'm going to commit a patch to add the run state to the info.html / textinfo.html page.)
I did that – see below.
You can use gdb to connect to the running ntop. For example:
<snip>
Check the run mode:
I don’t know how to interpret the “$1 = 4” after the command: print myGlobals.ntopRunState.
I pasted (in dark red) the corresponding lines from the log on each thread. Interesting: it seems the NPA and NPS threads are not dead after all, but the NPS thread (#4) seems to be in a weird state.
It seems the “Bad magic number” message comes from the call of traceEvent at line 71 of util.c:
<snip>
Loaded symbols for /usr/local/ntop3cvs/lib/ntop/plugins/snmpPlugin.so
Reading symbols from /lib/libnss_dns.so.2...done.
Loaded symbols for /lib/libnss_dns.so.2
0xffffe002 in ?? ()
(gdb) info stack
#0 0xffffe002 in ?? ()
#1 0x400eac32 in _ntopSleepWhileSameState (file=0x8052114 "main.c", line=588, uSeconds=4294967292) at util.c:3997
#2 0x0804ad2b in main (argc=4, argv=0xbfffe4e4) at main.c:588
#3 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6
(gdb) frame 1
#1 0x400eac32 in _ntopSleepWhileSameState (file=0x8052114 "main.c", line=588, uSeconds=4294967292) at util.c:3997
3997 rc = _ntopSleepMSWhileSameState(file, line, 1000L*uSeconds) / 1000L;
(gdb) print myGlobals.ntopRunState
$1 = 4
(gdb) info threads
9 Thread 1117223728 (LWP 30631) 0xffffe002 in ?? () <== THREADMGMT[t1117223728]: NPA: network packet analyzer (packet processor) thread running [p30628]
8 Thread 1125616432 (LWP 30632) 0xffffe002 in ?? () <== THREADMGMT[t1125616432]: SFP: Fingerprint scan thread starting [p30628]
7 Thread 1134009136 (LWP 30633) 0xffffe002 in ?? () <== THREADMGMT[t1134009136]: SIH: Idle host scan thread running [p30628]
6 Thread 1142401840 (LWP 30634) 0xffffe002 in ?? () <== THREADMGMT[t1142401840]: DNSAR(1): Address resolution thread running [p30628]
5 Thread 1150794544 (LWP 30636) 0xffffe002 in ?? () <== THREADMGMT[t1150794544]: WEB: Server connection thread starting [p30628]
4 Thread 1160776496 (LWP 30637) 0xffffe002 in ?? () <== THREADMGMT[t1160776496]: NPS(1): Started thread for network packet sniffing
3 Thread 1195379504 (LWP 31115) 0xffffe002 in ?? () <== THREADMGMT[t1195379504]: RRD: Data collection thread starting [p30628]
2 Thread 1205865264 (LWP 31116) 0xffffe002 in ?? () <== THREADMGMT[t1205865264]: RRD: Throughput data collection: Thread starting [p30628]
1 Thread 1102172480 (LWP 30628) 0xffffe002 in ?? () <== THREADMGMT[t1102172480]: Now running as a daemon
(gdb) thread 9
[Switching to thread 9 (Thread 1117223728 (LWP 30631))]#0 0xffffe002 in ?? ()
(gdb) info stack
#0 0xffffe002 in ?? ()
#1 0x400d2692 in dequeuePacket (notUsed=0x0) at pbuf.c:2407
#2 0x41819484 in start_thread () from /lib/tls/libpthread.so.0
(gdb) thread 4
[Switching to thread 4 (Thread 1160776496 (LWP 30637))]#0 0xffffe002 in ?? ()
(gdb) info stack
#0 0xffffe002 in ?? ()
#1 0x420daf2f in syslog () from /lib/tls/libc.so.6
#2 0x400e87d9 in traceEvent (eventTraceLevel=2, file=0x400fa00e "util.c", line=71,
format=0x400fa820 "Error: bad magic number [expected=%d/real=%d][deviceId=%d]") at util.c:2665
#3 0x400e4a21 in _getFirstHost (actualDeviceId=0, beginIdx=1) at util.c:71
#4 0x400ba79f in updateDeviceHostNameInfo (addr=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, symbolic=0x452fe148 "132.210.146.195", actualDeviceId=0, type=19)
at address.c:54
#5 0x400ba832 in updateHostNameInfo (addr=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, symbolic=0x452fe148 "132.210.146.195", type=19) at address.c:81
#6 0x400bb858 in ipaddr2str (hostIpAddress=
{hostFamily = 2, addr = {_hostIp4Address = {s_addr = 2228392643}, _hostIp6Address = {in6_u = {u6_addr8 = "?\222?\204", '\0' <repeats 11 times>, u6_addr16 = {37571, 34002, 0, 0, 0, 0, 0, 0}, u6_addr32 = {2228392643, 0, 0, 0}}}}}, updateHost=1) at address.c:871
#7 0x400c296c in lookupHost (hostIpAddress=0x452fe4f8, ether_addr=0x452fe8c8 "", vlanId=-1, checkForMultihoming=1 '\001', forceUsingIPaddress=0 '\0',
actualDeviceId=0) at hash.c:1101
#8 0x400ce65f in processIpPkt (bp=0x452fe926 "E", h=0x453009a8, length=522, ether_src=0x452fe8ce "", ether_dst=0x452fe8c8 "", actualDeviceId=0, vlanId=-1)
at pbuf.c:1088
#9 0x400d59e0 in processPacket (_deviceId=0x0, h=0x453009a8, p=0x452fe918 "") at pbuf.c:3528
#10 0x400d2356 in queuePacket (_deviceId=0x0, h=0x453009a8, p=0x80c7c72 "") at pbuf.c:2317
#11 0x419a15aa in pcap_read_packet (handle=0x80c7ad0, callback=0x400d21a8 <queuePacket>, userdata=0x42f8f110 "") at pcap-linux.c:638
#12 0x419a2b9a in pcap_loop (p=0x80c7ad0, cnt=-1, callback=0x400d21a8 <queuePacket>, user=0x0) at pcap.c:81
#13 0x400c9406 in pcapDispatch (_i=0x0) at ntop.c:97
#14 0x41819484 in start_thread () from /lib/tls/libpthread.so.0
(gdb)
Back to what's wrong...
* There are 10K hosts - so if it's something really, really generic, we would see it much more often and much sooner.
How much memory on the host?
Are you running out?
Please scan back in the log for allocation failure messages.
The host has 4 gigs of memory (I wrote 1.5 in the original message but that was from memory - I just checked and it has 4 gigs of real memory). I haven’t seen it run out recently and no messages to that effect appear in the logs.
I don’t see anything that could have happened at 1:44 that seems obviously wrong. As you know 1:44 is in the middle of the night (OK maybe the start of the night depending on who you ask!) but I was asleep at home and dreaming of other things than ntop... :-)
* When you started to get the bad magic number messages, there were 11 instances.
What changed around 1:44? It looks like there are 18 hosts that were bad and yet most of them were idle at 1:49.
Does 11 ring any bells? You know, back to Mr. Rogers - some of these things are not like the others? Which of these things do not belong??
We are a university and we have students who have Internet access in the residence halls so there is always some traffic even in the middle of the night, but there is a lot more (3 to 4 times more) during the day.
I think what you ask for is included in the gdb trace above.
* Another question is WHICH bad magic number message this is? Can you run with a higher trace level so that we get the file/line info?? (Again, I'm going to be commiting a patch to improve this data).
It looks like the one from hash.c around 802 in lookupHost(), so it could be called from just about anywhere...
I will do that next and report here the next time it happens.
Next Steps: Grab the cvs (wait for the 621 and 622 commit messages) and rerun the test - when it goes bad, please do the gdb stuff above, and also grab the updated log messages. This should give us much more info.
Again, thank you for your support!
Regards,
Marc.
--
Marc Mazuhelli
Computer security analyst
Service des technologies de l'information
Universite de Sherbrooke
Sherbrooke, Quebec, Canada
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Marc Mazuhelli
Sent: Tuesday, August 09, 2005 9:25 AM
To: [email protected]
Subject: [Ntop] A lot of "**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]" messages...
Hi,
I’ve been having a problem with ntop, both version 3.1 and a version that I installed from the CVS a few days ago. After running for a few hours or a few days, I see tons of messages that look like:
**WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Any idea what can be causing this? I checked the FAQ and back traffic for a few months and didn’t find anything.
With version3.1 I seem to recall that ntop completely crashed at that point. In the CVS version the web server keeps on running but no traffic is being reported – the sniffing thread seems to have stopped working and the “network load statistics” pages show no more traffic.
Here are a few lines from the log starting a few minutes before the incident (which occurred at 01:49:24 for this run):
Aug 9 01:44:31 gbp ntop[30628]: [MSGID8757584] SFP: Ending fingerprint scan cycle 406 - checked 123, resolved 0
Aug 9 01:44:57 gbp ntop[30628]: [MSGID8931710] RRD: Cycle 203 ended, 60 RRDs updated
Aug 9 01:45:23 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1820 [out of 10468] hosts selected
Aug 9 01:45:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1820/10465 hosts deleted, elapsed time is 0.049588 seconds (0.000027 per host)
Aug 9 01:47:01 gbp ntop[30628]: [MSGID8757584] SFP: Ending fingerprint scan cycle 407 - checked 126, resolved 0
Aug 9 01:47:24 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1298 [out of 10094] hosts selected
Aug 9 01:47:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1298/10092 hosts deleted, elapsed time is 0.044375 seconds (0.000034 per host)
Aug 9 01:49:24 gbp ntop[30628]: [MSGID0825709] IDLE_PURGE: Device 0 [eth1] FINISHED selection, 1385 [out of 10360] hosts selected
Aug 9 01:49:24 gbp ntop[30628]: [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug 9 01:49:24 gbp last message repeated 11 times
Aug 9 01:49:24 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1385/10358 hosts deleted, elapsed time is 0.044406 seconds (0.000032 per host)
Aug 9 01:49:24 gbp ntop[30628]: [MSGID9123622] **WARNING** Error: bad magic number [expected=1968/real=0][deviceId=0]
Aug 9 01:49:31 gbp last message repeated 108227 times
<snip />
As you can see the number of hosts is going down as inactive hosts are being deleted and no new traffic is being seen. Here are the next few IDLE_PURGE lines:
Aug 9 01:57:25 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1792/3972 hosts deleted, elapsed time is 0.045701 seconds (0.000026 per host)
Aug 9 01:59:26 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 2178/2180 hosts deleted, elapsed time is 0.053174 seconds (0.000024 per host)
Aug 9 02:01:26 gbp ntop[30628]: [MSGID8477291] IDLE_PURGE: Device 0 [eth1]: 1/2 hosts deleted, elapsed time is 0.000308 seconds (0.000308 per host)
Aug 9 02:03:26 gbp ntop[30628]: [MSGID8439789] IDLE_PURGE: Device eth1: no hosts [out of 1] deleted
Here is a report of my configuration as generated by the “show config” page. I’m sniffing a span port on a Cisco Catalyst 3500 which is located between our firewall and our router to the Internet. Ntop is running on an IBM X335 (xeon) with 1.5gigs of RAM with RedHad 9 – nothing else runs on that machine.
Basic Information
ntop Version.....3.1.50
Configured on.....Aug 1 2005 8:20:49
Built on.....Aug 1 2005 08:21:00
OS.....i686-pc-linux-gnu
Running from...../usr/local/ntop3cvs/bin
Libraries in...../usr/local/ntop3cvs/bin
Library path.....(nil)
ntop Process Id.....30628
http Process Id.....30628
Command line
Started as........./usr/local/ntop3cvs/bin/ntop -i eth1 @/etc/ntop3cvs.conf
Resolved to........./usr/local/ntop3cvs/bin/ntop -i eth1 --user ntop --daemon --db-file-path /var/run/ntop3cvs -i eth1 -m xxx.xxx.0.0/16,yyy.yyy.yyy.0/22,yyy.yyy.yyy.0/23,yyy.yyy.yyy.0/24,zzz.zzz.zzz.0/24 --trace-level 5 --use-syslog=local2 --http-server 3000 --https-server 0 --no-mac --track-local-hosts --disable-schedyield --max-table-rows 50 --refresh-time 300 --access-log-file /var/run/ntop3cvs/access.log --skip-version-check
<snip />
Regards,
Marc Mazuhelli, analyste en sécurité informatique
Service des technologies de l'information, Université de Sherbrooke
Adresses de courriel:
[EMAIL PROTECTED] pour signaler des incidents
de sécurité informatique (virus, intrusions, etc.)
[EMAIL PROTECTED] pour m'écrire directement
téléphone: 819-821-8000, poste 1019
télécopieur: 819-821-8045
_______________________________________________
Ntop-dev mailing list
[email protected]
http://listgateway.unipi.it/mailman/listinfo/ntop-dev
_______________________________________________ Ntop-dev mailing list [email protected] http://listgateway.unipi.it/mailman/listinfo/ntop-dev
