Michal - a nice piece of work. I think you've finally put all the pieces together. I do think however, that there are some bugs which should be reported to the various maintainers. And I'm still not happy with your proposed solution.
See my notes in-line... Where the heck did you find the documentation for all of this? I've looked numerous times and never came close... I guess I don't know the magic words to search for in the BSD sites... A++ -----Burton > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Behalf Of Meloun Michal > Sent: Thursday, February 05, 2004 6:58 AM > To: [EMAIL PROTECTED] > Subject: [Ntop-dev] FreeBSD and pthreads > Hi all, > First, please, excuse my English - may English skill is vary bad, I know. Nah. your English is fine. MUCH, much better than my zero words of Czech. If you really want, email me off line and I'll do a edit for you, but it's really just the occasional missing adjective/adverb that us native speakers (oh bother with that 'Queen's English' nonsense all you Brit's) had drummed into us over 12-20 years of schooling. Stuff like "I probably found real explanation" should be "I have probably found the real explanation". But don't sweat it - it certainly doesn't make the information any less useful nor incorrect! > I probably found real explanation for FreeBSD locking problem - > the problem > is common for all systems with user-mode pthread library - and > unfortunately, > it's not a bug - all works as declared (plus/minus). Also, I must say, my > previous trace to pcap_open_live() is not in direct relation with > root of this > problem. More that, I can (probably) also explain high packet > drops and high > CPU usage, with are seen on *BSD platforms. Both issues are only > manifestation > of same "base" problem. > The main cause is user-mode pthread library ant its interaction with bpf > interface. <rant>I've read and re-read this and it still, IMHO, comes down to the fact that xBSD has user-land threads (whether that's just FreeBSD, OpenBSD, NetBSD all or some of them is irrelevant) and the way libpcap uses the /dev/bpfN stuff is defective. That's a libpcap problem and should be reported as such - the whole dang idea of using libpcap is to hide all of the device and OS specific issues from ntop! libpcap already has a specialized freebsd module, why doesn't the fix go in there?????</rant> > Some theory first: > User-mode pthread library implement thread in user space of > process, without > any special support from kernel. The kernel has no knowledge about user > threads. All thread operation (context switching, thread scheduling etc) > are made in user mode library. This implementation has some major issues, > but only one is for us. So to the kernel, there is ONE thread. It's either in run state or blocked. Why it's blocked isn't important (to the kernel). This means that all of the POSIX stuff is simulated. Which is perfectly acceptable to the POSIX standard - it's pretty loose about the specifics of which thread gets the signal and who gets woken when - having a control thread is perfectly legal. > If any of user threads blocks in syscall, then whole application > (all threads) > blocks too. This is very important issue. To avoid this, the > pthreads library > contains wrappers over all blocking syscalls, and implements some > mechanism > which converts (potential) blocking syscalls into nonblocking > variants. But, > this mechanism is only workaround in most cases. The only "ready for use" > blocking syscalls are nanosleep() and select(), nothing more. This two > syscalls are implemented properly (doesn't block in kernel and > doesn't eat CPU > actively) The all other blocking syscals are converted into non-blocking > variants, witch waits in active cycle for completion - so thread > consumes 100% > of CPU time), or passed directly to kernel (if nonblocking > variant not exist) > - so syscall block all other threads until return from kernel. What you call an 'active' process is typically refered to as polling() Google define:polling has this about 1/3 of the way down: The process of repeatedly testing until a condition becomes true. Polling can be inefficient if the time between tests is short compared with the time it will take for the condition to become true. A polling thread should sleep between consecutive tests in order to give other threads a chance to run. An alternative approach to polling is to arrange for an interrupt to be sent when the condition is true, or to use the wait and notify mechanism associated with threads. www.cs.ukc.ac.uk/people/staff/djb/book/glossary.html But returning to your analysis, most calls are wrapped to convert them into polling calls. OK, that certainly explains the cpu usage (and incidentally why changing select() to poll() didn't change anything in my tests). > Allow me explain it on standard file operations. > > The code: > > fd = open(..); > read(fd, ?) > > executed in user-mode pthread library context works little differently: > > fd = open(?) is called with O_NONBLOCK flag added, so all > operation on this > fd are nonblocking ( flag is added in uthread_fd.c, using > __sys_fcntl(fd, F_SETFL, entry->flags | O_NONBLOCK);). > > And the read(fd,..) is converted into active wait loop (code is striped) : > > while ((ret = __sys_read(fd, buf, nbytes)) < 0) { > if (errno != EWOULDBLOCK) > break; > } > > So the reading thread actively waits for read completion. Nice example. Crystal clear - and all of this is happening under the covers, so to speak, so when you port 'Standard Unix' code, you never see the differences except in the CPU usage... > This is very important mainly with interaction with /dev/bpfxxx > device driver > (pcap library uses this device driver). The behavior of this > drive is special > (other that regular files, pipe or sockets) in many cases. One difference > (important) is: the O_NONBLOCK in open is ignored, and > nonblocking mode must > be set using ioctl(). So if read() is executed on bpf device then simply > block all threads within process until read syscall returns back into > user space. I really don't understand /dev/bpfN. Other than a lot of complaints about 'not enough' devices, there's not much information. As a guess, this is a 'fake' device which allows you to read only those packets which pass the filter expression. So conceptually there's a /dev/sis0 (or whatever the NIC is called) that you can - in userland with the right permissions - read all packets from, and a /dev/bpfN that - also in userland - you can read only those that pass the filter. This means only the filter program has to be attached to the kernel, instead of the entire libpcap. Right or Wrong??? Interestingly, with the hints from your work, I was able to find this http://www.ethereal.com/lists/ethereal-dev/199901/threads.html#00014, esp. http://www.ethereal.com/lists/ethereal-dev/199901/msg00050.html which is a not totally coherent discussion of the state of threads a few years ago, userland vs. kernel threads, etc. What makes it interesting is that they discuss the wrappers and make it seem that select() working at all with /dev/bpfN is more accidental than purposeful. > Unfortunately, ntop uses pcap_dispatch() in main packet reading thread. > And pcap_dispatch() is implemented using this code (pcap-bpf.c): > if (p->cc == 0) { > cc = read(p->fd, (char *)p->buffer, p->bufsize); So shouldn't this be wrapped, perhaps with the internal version of sched_yield()??? I mean isn't this really the libpcap bug we're fighting?? If they properly converted this, it would return EAGAIN and you could wrap it thusly: if (p->cc == 0) { do {cc = read(p->fd, (char *)p->buffer, p->bufsize); if (cc != EAGAIN) break; sched_yield() }; ???? > so then this call blocks all other ntop threads (including web > server in select), > until bpf device return filled buffer back to ntop. And this can be very > long time, on lightly loaded network. Right... I understand it now. Explains why the web server seems to come up after a long while and even can be snappy. Also explains why Stanley saw some machines work and others hang - it's roughly dependent on enough traffic to keep this from 'hanging'. >And worse that, ntop code calls > sched_yield() at many places -> so then main packed thread is regulary > scheduled - and block occurs again and again. Maybe / maybe not. Just because you give up the CPU doesn't mean you don't get it back - there's no guarantee. In fact, depending on whether sched_yield() is wrapped (so the remainder of the slice goes to another user-land thread) or not (returns to the kernel), you could see vastly different results. This probably explains why the --disable-schedyield flag helps. We also don't call it 'a lot of places', they're mostly in the middle of the purge process so that we don't starve everthing else. There's also a set of calls when we queue a packet instead of processing it directly and we then sched_yield() so maybe dequeue gets a shot right then instead of attempting to process the next packet and blocking. > The hang is hard (read never returns) if y have no traffic on network > (im no sure about signals delivery here). But, because ntop uses > pcap_open_live() > with timeout, then pcap_dispatch() returns at regular intervals, > allowing > slowly processing of other threads. Signals would be a POSIX issue - there's no guarantee which thread gets them. Practically, from your description, they all go to this thread controller process and which one IT picks to wake up isn't documented. It could be one we want, or it could just go back to sleep... > Unfortunately, this have next one issue (probably FreeBSD specific). Really, I think ALL of this is FreeBSD specific (or at least userland threads OS specific). Since we don't run under NetBSD except single threaded and we have only one user under OpenBSD - whom I haven't heard from since a last 'it starts up and works (for a little while)' report, for ntop, userland threads == FreeBSD. > The pcap_open_live() function uses BIOCSRTIMEOUT ioctl to pass timeout > value down to bpf driver. > But man page for bpf have this sentence: > BIOCGRTIMEOUT (struct timeval) > Set or get the read timeout parameter. > The argument specifies the length of time to > wait before > timing out on a read request. This parameter > is initial- > ized to zero by open(2), indicating no timeout. > > > Note -> "This parameter is initialized to zero by open(2), > indicating no timeout". Here's where I think there is a bug in libpcap or in the thread wrapper. Either the wrapped calls library or libpcap itself should be CONSISTENTLY converting these blocking calls into non-blocking + polling calls. if it's documented that the parameter isn't honored, then the #if FREEBSD code should be in one of those two libraries, not expecting our program to make a special call for this one variant case... WHY IS THIS ONE CALL DIFFERENT THAN ALL THE OTHERS??? If there's a reason for it, then it should be 'well known' and the libpcap code should make the ioctl() call. If there's no good reason for it, then the thread wrapper should be fixed. Either way, it's not ntop's bug, although we are going to have to code around it, just like the PR 53515 crud... > And because fork() uses dup2() for file descriptor cloning, and dup2() > on FreeBSD uses open(), then fork() also clears timeout value. > This explain why > is order of fork() and pcap_open_live() important - in one > case, the pcap_dispatch() > blocks in kernel until get data without timeout, in second case > pcap_dispatch() has 100ms timeout - so other threads can runs. This is fun. The usual answer is that "POSIX threads and fork() don't co-exist" (the fact that everyone does it is conveniently forgotten). But if I'm understanding the actual 'hang', you should be able to construct a small failing program without a single fork() call. Just libpcap+POSIX. That might get attention! > But, the bug is here in all cases - using of blocking read() in > user-mode pthread > library is simply prohibited. > > Proposed solution: > > All changes are in pcapDispatch() > - Use ioctl(myGlobals.device[i].pcapPtr ->fd, BIOCSRTIMEOUT, > ...) for restore > timeout value. > - Always set nonblocking mode for pcap Why? The existing --set-pcap-nonblock seems to work just fine. It sounds like we should just force the switch set on if it's freebsd... What does adding ioctl() do? It's got to be a poll() type loop anyway - at least my version uses nanosleep() so it doesn't peg cpu usage to 100%! If you want to create a patch, I guess we can benchmark the two. I think it would make things clearer to me, vs. just words. > - And (and mainly) use select() before pcap_dispatch() I still renew my objection - I don't know what this is going to do to ntop. If nothing else, it makes my planned post 3.0 thread watchdog impossible, since I would have to stop the packet capture, restart the web server and then restart packet capture. Nixing the whole idea of being able to restart a dead web server w/o impacting the counts we've already accumulated. > I'm ready to answer to any addition question, or if anything > needs be more > detailed, simply anything. > > Michal Meloun _______________________________________________ Ntop-dev mailing list [EMAIL PROTECTED] http://listgateway.unipi.it/mailman/listinfo/ntop-dev
