Um... looking over the code I think this is a bad idea ... if any thread blocks, it will also be holding sCM and thus single thread all of ntop.
If you're going to do this, change the mutex_lock to a trylock and if it fails, clear the sCM, then do a lock. Upon return from the lock, check if you hold sCM and lock it if you don't. That only exposes the priority reversion thing, where somebody else is locked on the same mutex, but also holds sCM. That's the piece I can't seem to fix. -----Burton -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of [EMAIL PROTECTED] Sent: Friday, March 07, 2003 2:05 AM To: [EMAIL PROTECTED] Subject: [Ntop-dev] New ntop commit (author deri) Update of /export/home/ntop/ntop In directory jabber:/tmp/cvs-serv29515 Modified Files: util.c Log Message: Fix (hopefully) for the problem: Mar 4 10:34:14 tigger ntop[5834]: WARNING: releaseMutex() call with an UN-LOCKED mutex [rrdPlugin.c:1258] last unlock [pid 5837, pbuf.c:2528] I have moved the stateChangeMutex position in the lock/unlock/tryLockMutex. Basically the PthreadMutex structure should now be updated atomically and not as it used to be (I paste below the mail from Burton that clarifies the problem tackled by this patch). Many thanks to Burton for highlighting the problem. ==================================== From: "Burton M. Strauss III" <[EMAIL PROTECTED]> To: "Ntop-Dev" <[EMAIL PROTECTED]> Cc: "Luca Deri" <[EMAIL PROTECTED]> Subject: RE: WARNING: UN-LOCKED mutex (was RRD problems?) Date: Tue, 4 Mar 2003 17:24:19 -0600 OK, I've (I think) nailed it down. By dint of a number of traceEvent() messages, I was able to capture what's going on - of course, the very existence of the traceEvent() messages made the failure much, much rarer (anyone want to have 60M of log messages?) The good news is that it's not corrupting data. The bad news is there's no easy solution. What normally happens when you use a mutex is this: Thread A Thread B LOCK ... LOCK(blocks) UNLOCK (unblocks) ... or ... UNLOCK And whether Thread A continues immediately after the UNLOCK or thread B starts up, the mutex is correctly protecting the data used by the 1..n pcap threads and the rrd thread. I've verified this behavior in the code and via the traceEvent() messages. So our data is safe. If you look at the ntop instructions (util.c) at a more detailed level, you'll see that before and after the LOCK and UNLOCK calls, we set some data so we can report on the usage of the mutexes. What we actually have within ntop is a standard POSIX mutex with some external data linked to the pthread_mutex_t variable. Data that we're trying to treat as if it were part of the opaque mutex data: typedef struct pthreadMutex { pthread_mutex_t mutex; char isLocked, isInitialized; char lockFile[64]; int lockLine; pid_t lockPid; char unlockFile[64]; int unlockLine; pid_t unlockPid; u_int numLocks, numReleases; time_t lockTime; char maxLockedDurationUnlockFile[64]; int maxLockedDurationUnlockLine; int maxLockedDuration; char where[64]; char lockAttemptFile[64]; int lockAttemptLine; pid_t lockAttemptPid; } PthreadMutex; Now the lockAttemptxxxxx stuff can be stepped on, it's just for reporting ... if we really have a deadlock, it will almost certainly be set sometimes, so we can see the deadlock. If it gets stepped on once in a while, s'be'it. And realistically, the other data fields are used only for reporting also, so if they get stomped, well, it's not the end of the earth. But we do use the isInitialized and isLocked fields to generate our warning messages - and that's the problem. Luca tried to add a stateChangeMutex (global, across all mutexes and threads) to control changes to the state data. However, there's a window (race) between the UNLOCK and the LOCK/set/UNLOCK of this other mutex. Follow along: The first letter, A or B is the thread whose instruction is actually being executed. The # is the line number in util.c, in the current cvs (04March2003, 4pm CST). Time flows down. Assume a uni-processor machine for simplicity - it's just much more likely with SMP. Typically A and B are both pcap (packet capture) threads, but there's also a lock of the same mutex in the rrd dumpHosts logic. (We see the messages more with RRD running and on sites with multiple interfaces because there's more threads = more chances for the race to occur). When we pick up the game, thread A has the mutex locked, and thread B is waiting (for a packet to arrive) B Packet arrives... B Processes ... B calls _accessMutex B 0931 if(mutexId == NULL) { B 0938 if(!mutexId->isInitialized) { B 0949 myPid=getpid(); B 0950 if(mutexId->isLocked) { B 0960 strcpy(mutexId->lockAttemptFile, fileName); B 0961 mutexId->lockAttemptLine=fileLine; B 0962 mutexId->lockAttemptPid=myPid; B 0964 rc = pthread_mutex_lock(&(mutexId->mutex)); Since A has the mutex locked, B BLOCKS and A begins to process A Processes ... A calls _releaseMutex A 1125 if(mutexId == NULL) { A 1132 if(!mutexId->isInitialized) { A 1139 pthread_mutex_lock(&stateChangeMutex); A 1141 if(!mutexId->isLocked) { DOES NOT PRINT WARNING ... this is correct A 1152 rc = pthread_mutex_unlock(&(mutexId->mutex)); With the mutex unlocked, B's lock call will succeed and BOTH threads are now ready to run. Which one actually runs is left up to the implementation of pthreads for the OS. It can be deterministic or random or anything else. What often happens is that A continues running, because after all, it's the thread that has control of the processor and it's time slice probably hasn't expired. But it doesn't HAVE to be that way. Remember, threads can be separate OS processes or not. And the thread manager may (or may not) be a separate OS process. So there could be OS task switches involved. Or not. Version 1 - A continues to run... ========= A 1153 pthread_mutex_unlock(&stateChangeMutex); A 1155 if(rc != 0) A 1158 else { A 1177 } A 1180 pthread_mutex_lock(&stateChangeMutex); A 1181 mutexId->isLocked = 0; ... Whenever B picks up, it sets the add-on information: B 0966 mutexId->lockAttemptFile[0] = '\0'; B 0967 mutexId->lockAttemptLine=0; B 0968 mutexId->lockAttemptPid=(pid_t) 0; B 0970 if(rc != 0) B 0973 else { B 0980 mutexId->numLocks++; B 0981 pthread_mutex_lock(&stateChangeMutex); B 0982 mutexId->isLocked = 1; ... And we have the correct external state information (the mutex is locked, by B). The lock on stateChangeMutex seems to protect isLocked. But, what about version 2?? Version 2 ========= B starts running... (which is perfectly legal for an unlock to cause a task switch). B 0966 mutexId->lockAttemptFile[0] = '\0'; B 0967 mutexId->lockAttemptLine=0; B 0968 mutexId->lockAttemptPid=(pid_t) 0; B 0970 if(rc != 0) B 0973 else { B 0980 mutexId->numLocks++; B 0981 pthread_mutex_lock(&stateChangeMutex); And B is now blocked... So A starts running A 1153 pthread_mutex_unlock(&stateChangeMutex); And a task switch occurs (this is perfectly legal - ANY unblocked thread can start up at ANY time). B 0982 mutexId->isLocked = 1; ... B 0989 pthread_mutex_unlock(&stateChangeMutex); So far, so good, the mutex is locked and isLocked shows it. But, then A continues running... A 1155 if(rc != 0) A 1158 else { A 1177 } A 1180 pthread_mutex_lock(&stateChangeMutex); A 1181 mutexId->isLocked = 0; ... Oops, isLocked is 0. When A calls _releaseMutex, we'll get the error message... Since there's no way in POSIX to 'swap' a lock (i.e. atomically unlock the mutex while locking the stateChangeMutex), there's no easy way to fix this. We can't just hold stateChangeMutex across the whole of the _accessMutex / _releaseMutex, because it would deadlock. My first thought is to blow away all of the lock data, except maybe lockAttempt, since we only use it for reporting anyway. My second thought is that we can't be the first people to want to do this, and there must be something on the web showing how to do this safely... Anyone want to add any thoughts? -----Burton -----Original Message----- From: Burton M. Strauss III [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 04, 2003 11:11 AM To: [EMAIL PROTECTED] Cc: Luca Deri Subject: WARNING: UN-LOCKED mutex (was RRD problems?) It's not related to anything else, AFAIK. It's something Luca and I are aware of but really don't understand well enough to fix. There is a mutex, hostsHashMutex (myGlobals, notch), which should prevent pbuf from updating the data that rrd is in the middle of dumping. If you look at my version of the log message, you'll see that different pids (threads) are doing the unlocks: Mar 4 10:34:14 tigger ntop[5834]: WARNING: releaseMutex() call with an UN-LOCKED mutex [rrdPlugin.c:1258] last unlock [pid 5837, pbuf.c:2528] (gdb) info thread 9 Thread 57352 (LWP 5837) 0x420d3b2e in select () from /lib/i686/libc.so.6 ... 6 Thread 32773 (LWP 5834) 0x420ae561 in nanosleep () from /lib/i686/libc.so.6 (gdb) thread 6 [Switching to thread 6 (Thread 32773 (LWP 5834))]#0 0x420ae561 in nanosleep () from /lib/i686/libc.so.6 (gdb) info stack #0 0x420ae561 in nanosleep () from /lib/i686/libc.so.6 #1 0x4045983c in nanosleep () from /lib/i686/libpthread.so.0 #2 0x420ae481 in sleep () from /lib/i686/libc.so.6 #3 0x4011d3a9 in ntop_sleep (secs=300) at util.c:3012 #4 0x40665b09 in rrdMainLoop (notUsed=0x0) at rrdPlugin.c:1098 #5 0x40455941 in pthread_start_thread () from /lib/i686/libpthread.so.0 (gdb) thread 9 [Switching to thread 9 (Thread 57352 (LWP 5837))]#0 0x420d3b2e in select () from /lib/i686/libc.so.6 (gdb) info stack #0 0x420d3b2e in select () from /lib/i686/libc.so.6 #1 0x40152e40 in __JCR_LIST__ () from /usr/lib/libntop-2.1.58pre.so #2 0x40455941 in pthread_start_thread () from /lib/i686/libpthread.so.0 And from the log, Mar 3 21:49:03 tigger ntop[5837]: THREADMGMT: pcap dispatch thread started... 5834 is the rrd thread, 5837 is the pcap capture (void* pcapDispatch(void *_i) {}) rrd should be locking the mutex as it's dumping blocks of hosts. We free the mutex every so often so that we don't block pcap. But if pcap runs during the time rrd has the mutex, it should BLOCK. Instead it seems to keep running... >From man pthread_mutex_lock: pthread_mutex_lock locks the given mutex. If the mutex is currently unlocked, it becomes locked and owned by the calling thread, and pthread_mutex_lock returns immediately. If the mutex is already locked by another thread, pthread_mutex_lock suspends the calling thread until the mutex is unlocked. We know it's not a recursive lock (see man pthread_mutexattr_init), because pcap would hang. And besides that's not the default Linux behavior. What I'm coming to think is that pcap IS sleeping, but is woken up by the arrival of another packet. That's going to take some proving and then some tricky coding. POSIX threads + Signals is a real boundary condition... -----Burton -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Behalf Of [EMAIL PROTECTED] Sent: Friday, February 28, 2003 10:22 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [Ntop-dev] RRD problems? That appeared to have solved the problem. Thanks. Enclosed is a curious output that seems to be related, although not a grievous at the previous "Errors".. What does this indicate? Anything I need to worry about? 28/Feb/2003 11:06:10 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1221] 28/Feb/2003 11:06:15 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1221] 28/Feb/2003 11:06:16 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1232] 28/Feb/2003 11:07:11 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1232] 28/Feb/2003 11:08:13 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1232] 28/Feb/2003 11:09:16 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1232] 28/Feb/2003 11:10:11 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1221] 28/Feb/2003 11:10:15 WARNING: releaseMutex() call with an UN-LOCKED mutex [pbuf.c:2528] last unlock [pid 16828, rrdPlugin.c:1232] 28/Feb/2003 11:10:16 WARNING: releaseMutex() call with an UN-LOCKED mutex [lastSeenPlugin.c:96] last unlock [pid 16826, ntop.c:646 -- J. Eric Josephson Director of Network and System Operations 978-720-2159 mailto:[EMAIL PROTECTED] _______________________________________________ Ntop-dev mailing list [EMAIL PROTECTED] http://listgateway.unipi.it/mailman/listinfo/ntop-dev _______________________________________________ Ntop-dev mailing list [EMAIL PROTECTED] http://listgateway.unipi.it/mailman/listinfo/ntop-dev
