Yeah, that narrows the window, but it's still there. The only real answer is to put a mutex protection around the reference here and in the purge. I'm just concerned about timing issues - during purge, the mutex will be held for quite a while - probably have to use the same type of periodic loop as in freeHostSessions().
-----Burton -----Original Message----- From: Dominique Lalot [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 27, 2003 11:56 AM To: Burton M. Strauss III Subject: RE: [Ntop] PR_MHVS2W ntop stops after a while My switch is buggy-> reboot it and see all the traffic!. Just an idea, as I also patched that code: You test if sport is initialized if Yes init value You test if dport is initialized if Yes init value fill value for sport fill value for dport I suggest as there is some race conditions too shorten the delay, as malloc for dport could be long. I noticed too that it never happens for dport.. a Tracevent with sterror(errno) should be better after malloc problems. Less work for support as people will understand a memory problem. You test if sport is initialized if Yes init value fill value for sport You test if dport is initialized if Yes init value fill value for dport Regards Dom A 06:52 27/05/2003 -0500, vous avez �crit : The 'beauty' of the patch, is that since it's testing and issuing a warning message, it's not using the NULL value to add to, so it's not segfaulting. You shouldn't see ANY of your TEMP5/TEMP6 messages. Look over the code... 580 if(myGlobals.device[actualDeviceId].ipPorts[sport] == NULL) { 581 traceEvent(CONST_TRACE_INFO, "TEMP: myGlobals.device[%d].ipPorts[%d] NULL", actualDeviceId, sport); 582 myGlobals.device[actualDeviceId].ipPorts[sport] = (PortCounter*)malloc(sizeof(PortCounter)); 583 if(myGlobals.device[actualDeviceId].ipPorts[sport] == NULL) { 584 traceEvent(CONST_TRACE_INFO, "TEMP: myGlobals.device[].ipPorts[] malloc() fail"); 585 } THERE SHOULD BE AND ELSE SO.. 586 myGlobals.device[actualDeviceId].ipPorts[sport] = (PortCounter*)malloc(sizeof(PortCounter)); 586 is non sense. One malloc is enough?. 587 myGlobals.device[actualDeviceId].ipPorts[sport]->port = sport; 588 myGlobals.device[actualDeviceId].ipPorts[sport]->sent = 0; 589 myGlobals.device[actualDeviceId].ipPorts[sport]->rcvd = 0; 590 } <snip /> 603 if(myGlobals.device[actualDeviceId].ipPorts[sport] == NULL) { 604 traceEvent(CONST_TRACE_INFO, "TEMP: myGlobals.device[%d].ipPorts[%d] NULL", actualDeviceId, sport); 605 } else 606 myGlobals.device[actualDeviceId].ipPorts[sport]->sent += length; If it's NULL @ 580, it allocates the PortCounter structure at 582, and we test (at 583) that the malloc() worked. No message, no segfault initializing the data at 586ff. Yet it's null @ 603 It has to be that the purge, static void purgeIpPorts(int theDevice) {} in main.c, is being fired off and overlapping the increment. Adding a mutex() there is going to be ugly performance-wise, but I can't help wondering WHY you're getting bit so reliably. Unless - do you have something like WhatsUpGold probing your systems every 5 minutes? Just infrequently enough for ntop to think it's inactive and just often enough to hit this rare condition?? -----Burton -----Original Message----- From: Dominique Lalot [mailto:[EMAIL PROTECTED] Sent: Tuesday, May 27, 2003 3:25 AM To: Burton M. Strauss III Subject: RE: [Ntop] PR_MHVS2W ntop stops after a while A 17:32 24/05/2003 -0500, vous avez �crit : Argh... Looking at the patch, I used the same text for both messages, so you can't tell them apart except for the MSGID. 581 & 593 are perfectly normal. That's where it creates the counters if they don't already exist. The ones of interest are 604 & 608, the tests at the point of the add. What I want to see is whether it's not null (so it's not created) and then yet, some how IS null at the time it's being added. -----Burton Burton, what is strange, is that applying your patch, stats for Data Sent looks very little, and there is only a few errors!.. ulimit is unlimited free total used free shared buffers cached Mem: 1030196 789556 240640 0 155220 408864 -/+ buffers/cache: 225472 804724 Swap: 2040212 1008 2039204 top 2854 nobody 22 0 73712 71M 2016 S 99.9 7.1 0:00 2 ntop 2861 nobody 15 0 73712 71M 2016 S 0.0 7.1 0:00 0 ntop 2862 nobody 24 0 73712 71M 2016 S 0.0 7.1 0:00 2 ntop 2863 nobody 15 0 73712 71M 2016 S 0.0 7.1 1:26 2 ntop 2864 nobody 15 0 73712 71M 2016 S 0.0 7.1 0:13 3 ntop 2865 nobody 16 0 73712 71M 2016 S 0.4 7.1 0:21 3 ntop 2866 nobody 15 0 73712 71M 2016 S 2.4 7.1 20:49 2 ntop [EMAIL PROTECTED] root]# grep TEMP /var/log/messages May 26 19:10:17 ad01u253 ntop[2866]: [MSGID00608-pbuf] TEMP6: myGlobals.device[0].ipPorts[1776] NULL May 27 05:54:18 ad01u253 ntop[2866]: [MSGID00604-pbuf] TEMP5: myGlobals.device[0].ipPorts[4664] NULL May 27 05:54:18 ad01u253 ntop[2866]: [MSGID00608-pbuf] TEMP6: myGlobals.device[0].ipPorts[4503] NULL May 27 08:46:35 ad01u253 ntop[2866]: [MSGID00604-pbuf] TEMP5: myGlobals.device[0].ipPorts[16683] NULL May 27 08:46:35 ad01u253 ntop[2866]: [MSGID00608-pbuf] TEMP6: myGlobals.device[0].ipPorts[6887] NULL if(myGlobals.device[actualDeviceId].ipPorts[sport] == NULL) { traceEvent(CONST_TRACE_INFO, "TEMP5: myGlobals.device[%d].ipPorts[%d] NULL", actualDeviceId, sport); } else myGlobals.device[actualDeviceId].ipPorts[sport]->sent += length; if(myGlobals.device[actualDeviceId].ipPorts[sport] == NULL) { traceEvent(CONST_TRACE_INFO, "TEMP6: myGlobals.device[%d].ipPorts[%d] NULL", actualDeviceId, dport); } else myGlobals.device[actualDeviceId].ipPorts[dport]->rcvd += length; -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 23, 2003 11:50 AM To: Burton M. Strauss III Subject: Re: [Ntop] PR_MHVS2W ntop stops after a while I applied, then stopped ntop, as there's to many prints: 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[3329] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[37179] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[37209] NULL 23/May/2003 18:22:19 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[1235] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[1222] NULL 23/May/2003 18:22:19 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[4665] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[1708] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[1053] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[37135] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[3290] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[2354] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[1620] NULL 23/May/2003 18:22:19 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[56721] NULL 23/May/2003 18:22:19 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[2978] NULL 23/May/2003 18:22:20 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[1709] NULL 23/May/2003 18:22:20 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[50010] NULL 23/May/2003 18:22:20 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[2355] NULL 23/May/2003 18:22:20 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[8050] NULL 23/May/2003 18:22:20 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[4655] NULL 23/May/2003 18:22:20 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[4537] NULL 23/May/2003 18:22:20 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[4656] NULL 23/May/2003 18:22:20 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[2356] NULL 23/May/2003 18:22:20 [MSGID00593-pbuf] TEMP: myGlobals.device[0].ipPorts[2358] NULL 23/May/2003 18:22:20 [MSGID00581-pbuf] TEMP: myGlobals.device[0].ipPorts[1425] NULL I just add some modifications, a number for TEMP in the order of your calls: 23/May/2003 18:38:34 [MSGID00581-pbuf] TEMP1: myGlobals.device[0].ipPorts[39192] NULL 23/May/2003 18:38:34 [MSGID00581-pbuf] TEMP1: myGlobals.device[0].ipPorts[4177] NULL 23/May/2003 18:38:34 [MSGID00593-pbuf] TEMP3: myGlobals.device[0].ipPorts[36514] NULL 23/May/2003 18:38:34 [MSGID00581-pbuf] TEMP1: myGlobals.device[0].ipPorts[34019] NULL 23/May/2003 18:38:34 [MSGID00593-pbuf] TEMP3: myGlobals.device[0].ipPorts[21027] NULL 23/May/2003 18:38:34 [MSGID00593-pbuf] TEMP3: myGlobals.device[0].ipPorts[7215] NULL 23/May/2003 18:38:34 [MSGID00581-pbuf] TEMP1: myGlobals.device[0].ipPorts[24077] NULL 23/May/2003 18:38:34 [MSGID00581-pbuf] TEMP1: myGlobals.device[0].ipPorts[3418] NULL Then comment 1 and 3 as the worst is comming from 2 and 4 malloc returning NULL I hope it's not a linux threading problem. I got strange behavior in mod_perl which is threaded under apache main modify a global call a sub the sub has not the same address for the global return main is still playing with the right address Obliged to work with a new parameter for the proc.. as I see myGlobals .. But you're much more experienced than me on such subjects. I'm leaving now ntop under gdb all the week-end Bye Dominique On Thu, May 22, 2003 at 11:28:51AM -0500, Burton M. Strauss III wrote: > That's the same problem that's been reported by others, except according to > the code, it can't happen... > > if(myGlobals.device[actualDeviceId].ipPorts[sport] == NULL) { > myGlobals.device[actualDeviceId].ipPorts[sport] = > (PortCounter*)malloc(sizeof(PortCounter)); > myGlobals.device[actualDeviceId].ipPorts[sport]->port = sport; > myGlobals.device[actualDeviceId].ipPorts[sport]->sent = 0; > myGlobals.device[actualDeviceId].ipPorts[sport]->rcvd = 0; > } > > if(myGlobals.device[actualDeviceId].ipPorts[dport] == NULL) { > myGlobals.device[actualDeviceId].ipPorts[dport] = > (PortCounter*)malloc(sizeof(PortCounter)); > myGlobals.device[actualDeviceId].ipPorts[dport]->port = dport; > myGlobals.device[actualDeviceId].ipPorts[dport]->sent = 0; > myGlobals.device[actualDeviceId].ipPorts[dport]->rcvd = 0; > } > > > myGlobals.device[actualDeviceId].ipPorts[sport]->sent += length; > myGlobals.device[actualDeviceId].ipPorts[dport]->rcvd += length; > > If ipPorts doesn't exist yet, it's created just above. And we've had others > test the malloc() returns. > > Since you've only got a single interface, thread problems can't be it. > > Tell you what, since you seem able to recreate it almost at will... > > Apply the attached patch to make it tell us when it's allocating the > PortCounter and again to test just before the add. > > grep out the TEMP lines and send them. Note that the patch should skip the > add if it's really null, so it won't bomb. > > If it DOES bomb, then it's getting corrupted (vs. null) and I'll want to see > the values for myGlobals.device[0].ipPorts[x] where x are the dport and > sport values. To print local variables, say from processPacket, you need to > use > (gdb) frame 1 > first. > > -----Burton > > -----Original Message----- > From: Dominique Lalot [mailto:[EMAIL PROTECTED] > Sent: Thursday, May 22, 2003 12:54 AM > To: Burton M. Strauss III > Subject: Re: [Ntop] PR_MHVS2W ntop stops after a while > > > Burton, > I was lucky enough to get the failure with the last snapshot 2.2.1 > seems to be the same problem. > Using last snapshot from yesterday ( ntop-03-05-21 ) I was able to trace > using gdb as written in FAQ: > Is it usable for you?. > > Thanks > > Beware that the configuration is almost the same, except that -C is no more > working. > > Dominique > > > > last snapshot > (gdb) set args -u root @manaix.ntop -L -K > (gdb) run > Starting program: /usr/bin/.libs/lt-ntop -u root @manaix.ntop -L -K > [New Thread 16384 (LWP 30646)] > Wait please: ntop is coming up... > Processing file manaix.ntop for parameters... > [New Thread 32769 (LWP 30653)] > [New Thread 16386 (LWP 30654)] > [New Thread 32771 (LWP 30655)] > [New Thread 49156 (LWP 30656)] > [New Thread 65541 (LWP 30657)] > [New Thread 81926 (LWP 30658)] > > > Program received signal SIGSEGV, Segmentation fault. > [Switching to Thread 81926 (LWP 30658)] > 0x4009b082 in updateInterfacePorts (actualDeviceId=0, sport=320, dport=80, > length=62) at pbuf.c:593 > 593 myGlobals.device[actualDeviceId].ipPorts[sport]->sent += length; > list > (gdb) list > 588 myGlobals.device[actualDeviceId].ipPorts[dport]->port = dport; > 589 myGlobals.device[actualDeviceId].ipPorts[dport]->sent = 0; > 590 myGlobals.device[actualDeviceId].ipPorts[dport]->rcvd = 0; > 591 } > 592 > 593 myGlobals.device[actualDeviceId].ipPorts[sport]->sent += length; > 594 myGlobals.device[actualDeviceId].ipPorts[dport]->rcvd += length; > 595 > 596 #ifdef CFG_MULTITHREADED > 597 releaseMutex(&myGlobals.gdbmMutex); > (gdb) bt > #0 0x4009b082 in updateInterfacePorts (actualDeviceId=0, sport=320, > dport=80, length=62) at pbuf.c:593 > No locals. > #1 0x4009d001 in processIpPkt (bp=0x8087928 "E", h=0xbefff9cc, length=62, > ether_src=0xbefff962 "", > ether_dst=0xbefff95c "", actualDeviceId=0, vlanId=-1) at pbuf.c:990 > sport = 2247 > dport = 80 > ip = {ip_hl = 5, ip_v = 4, ip_tos = 0 '\0', ip_len = 12288, ip_id = 12935, > ip_off = 64, ip_ttl = 125 '}', > ip_p = 6 '\006', ip_sum = 28095, ip_src = {s_addr = 2472469770}, ip_dst = > {s_addr = 3232267798}} > tp = {th_sport = 50952, th_dport = 20480, th_seq = 1324592302, th_ack = 0, > th_x2 = 0 '\0', > th_off = 7 '\a', th_flags = 2 '\002', th_win = 64, th_sum = 58592, th_urp = > 0} > up = {uh_sport = 13568, uh_dport = 37622, uh_ulen = 51712, uh_sum = 62118} > icmpPkt = {icmp_type = 0 '\0', icmp_code = 0 '\0', icmp_cksum = 0, icmp_hun > = {ih_pptr = 0 '\0', > ih_gwaddr = {s_addr = 0}, ih_idseq = {icd_id = 0, icd_seq = 0}, ih_void = 0, > ih_pmtu = {ipm_void = 0, > ipm_nextmtu = 0}, ih_rtradv = {irt_num_addrs = 0 '\0', irt_wpa = 0 '\0', > irt_lifetime = 0}}, icmp_dun = { > id_ts = {its_otime = 0, its_rtime = 1077415822, its_ttime = 1570113147}, > id_ip = {idi_ip = {ip_hl = 0, > ip_v = 0, ip_tos = 0 '\0', ip_len = 0, ip_id = 3982, ip_off = 16440, ip_ttl > = 123 '{', ip_p = 6 '\006', > ip_sum = 23958, ip_src = {s_addr = 182804115}, ip_dst = {s_addr = > 749316288}}}, id_radv = {ira_addr = 0, > ira_preference = 1077415822}, id_mask = 0, id_data = ""}} > hlen = 20 > tcpDataLength = 0 > udpDataLength = 194 > off = 16384 > tcpUdpLen = 28 > srcHostIdx = 1 > dstHostIdx = 1 > srcHost = (struct hostTraffic *) 0x8085ec8 > dstHost = (struct hostTraffic *) 0x8085ec8 > forceUsingIPaddress = 0 '\0' > tvstrct = {tv_sec = 0, tv_usec = 0} > theData = ( > u_char *) 0x8087958 > "[EMAIL PROTECTED])\203\016|+\203\016'��z�9��\200R-\037\037�\034\0 > 37K��U_�$v�\203��X��Q�g\2116�u_\206]Et|\221�" > ctr = {value = 62, modified = 68 'D'} > #2 0x4009fc59 in processPacket (_deviceId=0x0, h=0xbefff9cc, p=0x808791a "") > at pbuf.c:2496 > pppoe_hdr = (struct pppoe_hdr *) 0x3e > protocol = 0 > ehdr = {ether_dhost = "\000\t��P\200", ether_shost = "\000\0010\035�`", > ether_type = 8} > trp = (struct tokenRing_header *) 0x0 > fddip = (struct fddi_header *) 0x400ba380 > hlen = 14 > caplen = 62 > headerDisplacement = 0 > length = 62 > orig_p = (const u_char *) 0x808791a "" > p1 = (const u_char *) 0x0 > ether_src = (u_char *) 0xbefff962 "" > ether_dst = (u_char *) 0xbefff95c "" > eth_type = 2048 > trllc = (struct tokenRing_llc *) 0x3e > fd = (struct _IO_FILE *) 0x0 > ipxBuffer = > "[EMAIL PROTECTED]<��\001\000\000\0 > 00\000\000\000\000\000\230;[EMAIL PROTECTED]@�&[EMAIL PROTECTED]@@\025\000\000\000 > �s@ > [EMAIL PROTECTED]@@[EMAIL PROTECTED]@@[EMAIL PROTECTED] > \000�([EMAIL PROTECTED]> > \006 [EMAIL PROTECTED]@5\000\000\000\n4Q@" > actualDeviceId = 0 > vlanId = -1 > #3 0x405150ca in pcap_read () from /usr/lib/libpcap.so.0.6.2 > No symbol table info available. > #4 0x4051663b in pcap_dispatch () from /usr/lib/libpcap.so.0.6.2 > No symbol table info available. > #5 0x40097c69 in pcapDispatch (_i=0x0) at ntop.c:109 > rc = 0 > pcap_fd = 4 > readMask = {__fds_bits = {16, 0 <repeats 31 times>}} > #6 0x4037cc0e in pthread_start_thread_event () from /lib/libpthread.so.0 > No symbol table info available. > (gdb) print deviceId > No symbol "deviceId" in current context. > (gdb) > > _______________________________________________ Ntop mailing list [EMAIL PROTECTED] http://listgateway.unipi.it/mailman/listinfo/ntop _______________________________________________ Ntop mailing list [EMAIL PROTECTED] http://listgateway.unipi.it/mailman/listinfo/ntop
