RE: [Ntop-dev] FreeBSD and pthreads

Meloun Michal Fri, 06 Feb 2004 10:45:53 -0800

<text is stripped down and commented in-line>


> 
> Where the heck did you find the documentation for all of this?  I've looked
> numerous times and never came close...  I guess I don't know the magic words
> to search for in the BSD sites...

Mainly from FreeBSD CVS site. And, of course, Goole is always nice source 
of small amount of relevant documents (and tons of noise) :)

pcap library -> http://www.freebsd.org/cgi/cvsweb.cgi/src/contrib/libpcap/
bpf driver   -> http://www.freebsd.org/cgi/cvsweb.cgi/src/sys/net/
pthreads     -> http://www.freebsd.org/cgi/cvsweb.cgi/src/lib/libc_r/uthread/

Note - all stuff in "contrib" directory is imported from other projects and 
contents is frozen. Only reimport of new versions or security patches are 
allowed in this area.

> 
> > The main cause is user-mode pthread library ant its interaction with bpf
> > interface.
> 
> <rant>I've read and re-read this and it still, IMHO, comes down to the fact
> that xBSD has user-land threads (whether that's just FreeBSD, OpenBSD,
> NetBSD all or some of them is irrelevant) and the way libpcap uses the
> /dev/bpfN stuff is defective.  That's a libpcap problem and should be
> reported as such - the whole dang idea of using libpcap is to hide all of
> the device and OS specific issues from ntop!  libpcap already has a
> specialized freebsd module, why doesn't the fix go in there?????</rant>

I cannot find any FreeBSD specific in libpcap, only linux module. 
Handling of /dev/pbfN inside libpcap is correct for standard single threaded
program. But, of course, isn't correct for multithreaded environment.
Hard to say this can be called "bug" or if only its not designed for
multithreaded usage. Note - pcap library isn't thread safe at all. 

> 
> > Some theory first:
> > User-mode pthread library implement thread in user space of
> > process, without
> > any special support from kernel. The kernel has no knowledge about user
> > threads.  All thread operation (context switching, thread scheduling etc)
> > are made in user mode library. This implementation has some major issues,
> > but only one is for us.
> 
> So to the kernel, there is ONE thread.  It's either in run state or blocked.
> Why it's blocked isn't important (to the kernel).
> 
> This means that all of the POSIX stuff is simulated.  Which is perfectly
> acceptable to the POSIX standard - it's pretty loose about the specifics of
> which thread gets the signal and who gets woken when - having a control
> thread is perfectly legal.
> 

Yep, exactly.

> > If any of user threads blocks in syscall, then whole application
> > (all threads)
> > blocks too. This is very important issue. To avoid this, the
> > pthreads library
> > contains wrappers over all blocking syscalls, and implements some
> > mechanism
> > which converts (potential) blocking syscalls into nonblocking
> > variants.  But,
> > this mechanism is only workaround in most cases. The only "ready for use"
> > blocking syscalls are nanosleep() and select(), nothing more. This two
> > syscalls are implemented properly (doesn't block in kernel and
> > doesn't eat CPU
> > actively) The all other blocking syscals are converted into non-blocking
> > variants, witch waits in active cycle for completion - so thread
> > consumes 100%
> > of CPU time), or passed directly to kernel (if nonblocking
> > variant not exist)
> > - so syscall block all other threads until return from kernel.
> 
> What you call an 'active' process is typically refered to as polling()
> Google define:polling has this about 1/3 of the way down:
> 
> The process of repeatedly testing until a condition becomes true. Polling
> can be inefficient if the time between tests is short compared with the time
> it will take for the condition to become true. A polling thread should sleep
> between consecutive tests in order to give other threads a chance to run. An
> alternative approach to polling is to arrange for an interrupt to be sent
> when the condition is true, or to use the wait and notify mechanism
> associated with threads.
> www.cs.ukc.ac.uk/people/staff/djb/book/glossary.html
> 

> But returning to your analysis, most calls are wrapped to convert them into
> polling calls. OK, that certainly explains the cpu usage (and incidentally
> why changing select() to poll() didn't change anything in my tests).
> 

But don't mix poll() syscall with verb "polling" - poll() syscall is only
"modern" implementation of of select() - bit fields in select are really not
optimal.   

> > Allow me explain it on standard file operations.
> >
> > The code:
> >
> > fd = open(..);
> > read(fd, ?)
> >
> > executed in user-mode pthread library context  works little differently:
> >
> > fd = open(?) is called with O_NONBLOCK flag  added, so all
> > operation on this
> > fd are nonblocking ( flag is added in uthread_fd.c, using
> > __sys_fcntl(fd, F_SETFL, entry->flags | O_NONBLOCK);).
> >
> > And the read(fd,..) is converted into active wait loop (code is striped) :
> >
> >  while ((ret = __sys_read(fd, buf, nbytes)) < 0) {
> >    if (errno != EWOULDBLOCK)
> >     break;
> > }
> >
> > So the reading thread actively waits for read completion.
> 
> Nice example.  Crystal clear - and all of this is happening under the
> covers, so to speak, so when you port 'Standard Unix' code, you never see
> the differences except in the CPU usage...
> 
> > This is very important mainly with interaction with /dev/bpfxxx
> > device driver
> > (pcap library uses this device driver). The behavior of this
> > drive is special
> > (other that regular files, pipe or sockets) in many cases. One difference
> > (important) is: the O_NONBLOCK in open is ignored, and
> > nonblocking mode must
> > be set using ioctl(). So if read() is executed on bpf device then simply
> > block all threads within process until read syscall returns back into
> > user space.
> 
> I really don't understand /dev/bpfN.  Other than a lot of complaints about
> 'not enough' devices, there's not much information.
> 
> As a guess, this is a 'fake' device which allows you to read only those
> packets which pass the filter expression.  So conceptually there's a
> /dev/sis0 (or whatever the NIC is called) that you can - in userland with
> the right permissions - read all packets from, and a /dev/bpfN that - also
> in userland - you can read only those that pass the filter. This means only
> the filter program has to be attached to the kernel, instead of the entire
> libpcap.
> 
> Right or Wrong???

Its only partially right. Standard network driver (/dev/sis0) doesn't have data 
interface available for user. You cannot open(), close(), read() or write() 
from/to it. Instead of, the data interface is connected into socket network 
layer inside kernel.

Unlike that, the /dev/bpfN implements filtered "sniffer/injector", which can be 
virtually connected into wire. Moreover, it's possible to run multiple "clients" 
connected to same wire.


> 
> 
> Interestingly, with the hints from your work, I was able to find this
> http://www.ethereal.com/lists/ethereal-dev/199901/threads.html#00014, esp.
> http://www.ethereal.com/lists/ethereal-dev/199901/msg00050.html which is a
> not totally coherent discussion of the state of threads a few years ago,
> userland vs. kernel threads, etc.  What makes it interesting is that they
> discuss the wrappers and make it seem that select() working at all with
> /dev/bpfN is more accidental than purposeful.
> 

Yes - I read this too, here is many references for 
"select() doesn't work on *BSD systems", but its not true on FreeBSD at all.


> > Unfortunately,  ntop uses pcap_dispatch() in main packet reading thread.
> > And pcap_dispatch() is implemented using this code (pcap-bpf.c):
> >     if (p->cc == 0) {
> >             cc = read(p->fd, (char *)p->buffer, p->bufsize);
> 
> So shouldn't this be wrapped, perhaps with the internal version of
> sched_yield()???  I mean isn't this really the libpcap bug we're fighting??
> If they properly converted this, it would return EAGAIN and you could wrap
> it thusly:
> 
>       if (p->cc == 0) {
>               do {cc = read(p->fd, (char *)p->buffer, p->bufsize);
>                               if (cc != EAGAIN) break;
>                              sched_yield()
>                          };
> 
> ????
> 

libpcap is still only single threaded library, anything like this needs 2 separate
libraries (like libc and libc_r). IMHO, this is not a right way.


> > so then this call blocks all other ntop threads (including web
> > server in select),
> > until bpf device return filled buffer back to ntop. And this can be very
> > long time, on lightly loaded network.
> 
> Right... I understand it now.  Explains why the web server seems to come up
> after a long while and even can be snappy.  Also explains why Stanley saw
> some machines work and others hang - it's roughly dependent on enough
> traffic to keep this from 'hanging'.
> 
> >And worse that,  ntop code calls
> > sched_yield() at many places -> so then main packed thread is regulary
> > scheduled - and block occurs again and again.
> 
> Maybe / maybe not.  Just because you give up the CPU doesn't mean you don't
> get it back - there's no guarantee.  In fact, depending on whether
> sched_yield() is wrapped (so the remainder of the slice goes to another
> user-land thread) or not (returns to the kernel), you could see vastly
> different results.  This probably explains why the --disable-schedyield flag
> helps.
> 
> We also don't call it 'a lot of places', they're mostly in the middle of the
> purge process so that we don't starve everthing else.  There's also a set of
> calls when we queue a packet instead of processing it directly and we then
> sched_yield() so maybe dequeue gets a shot right then instead of attempting
> to process the next packet and blocking.
> 

Ohh, no, no. Its misunderstanding :) The sched_yield() usage is perfectly 
valid if threading works. Only magnifies problem with pcap_dispatch() block,
on FreeBSD().  The thread with pcap_dispatch() is scheduled more times -> 
more hangs.


> > The hang is hard (read never returns) if y have no traffic on network
> > (im no sure about signals delivery here). But, because ntop uses
> > pcap_open_live()
> > with timeout, then  pcap_dispatch() returns at regular intervals,
> > allowing
> > slowly processing of other threads.
> 
> Signals would be a POSIX issue - there's no guarantee which thread gets
> them.  Practically, from your description, they all go to this thread
> controller process and which one IT picks to wake up isn't documented.  It
> could be one we want, or it could just go back to sleep...
> 
> > Unfortunately, this have next one issue (probably FreeBSD specific).
> 
> Really, I think ALL of this is FreeBSD specific (or at least userland
> threads OS specific).  Since we don't run under NetBSD except single
> threaded and we have only one user under OpenBSD - whom I haven't heard from
> since a last 'it starts up and works (for a little while)' report, for ntop,
> userland threads == FreeBSD.

I disagree here -> the OpenBSD have userland pthreads too.
Remember, the "extreme" slowdown exhibited only if ntop is running in daemon mode.
(its due to "feature/bug" bellow). I notice this problem after 2 months of ntop 
usage, for example. 100 ms timeout (if ntop is not running as daemon) is fast
enough for web - but not for packet processing - so only one exhibition of 
problem is higher packet drop rate.


> 
> > The pcap_open_live() function uses BIOCSRTIMEOUT ioctl to pass timeout
> > value down to bpf driver.
> > But man page for bpf  have this sentence:
> >    BIOCGRTIMEOUT  (struct timeval)
> >                     Set or get the read timeout parameter.
> >                     The argument specifies the length of time to
> > wait before
> >                     timing out on a read request.  This parameter
> > is initial-
> >                     ized to zero by open(2), indicating no timeout.
> >
> >
> > Note -> "This parameter is initialized to zero by open(2),
> > indicating no timeout".
> 
> Here's where I think there is a bug in libpcap or in the thread wrapper.
> 
> Either the wrapped calls library or libpcap itself should be CONSISTENTLY
> converting these blocking calls into non-blocking + polling calls.  if it's
> documented that the parameter isn't honored, then the #if FREEBSD code
> should be in one of those two libraries, not expecting our program to make a
> special call for this one variant case...

The question is, if this its possible - pthread library known nothing about
/dev/bpfN, and libpcap, at compile time, have zero knowlede about threading.



> 
> WHY IS THIS ONE CALL DIFFERENT THAN ALL THE OTHERS???  If there's a 
reason
> for it, then it should be 'well known' and the libpcap code should make the
> ioctl() call.  If there's no good reason for it, then the thread wrapper
> should be fixed.  Either way, it's not ntop's bug, although we are going to
> have to code around it, just like the PR 53515 crud...
> 
> > And because fork() uses dup2() for file descriptor cloning, and dup2()
> > on FreeBSD uses open(), then fork() also clears timeout value.
> > This explain why
> > is order of fork() and  pcap_open_live() important  - in one
> > case, the pcap_dispatch()
> > blocks in kernel until get data without timeout, in second case
> > pcap_dispatch() has 100ms timeout - so other threads can runs.
> 
> This is fun.  The usual answer is that "POSIX threads and fork() don't
> co-exist" (the fact that everyone does it is conveniently forgotten).  

Its another long story :)
In current environment, IMHO, "POSIX threads and fork()" is legal 
combination if:
 - pthread library supports pthread_atfork(), pthread_atexit().. calls
 - all used multithreaded libraries supports it

Unfortunately, its not a case of FreeBSD.  


> But
> if I'm understanding the actual 'hang', you should be able to construct a
> small failing program without a single fork() call.  Just libpcap+POSIX.
> That might get attention!

Yes, true -> libpcap+POSIX hangs (only dont enable timeout in pcap_open_live()).

> 
> > But, the bug is here in all cases - using of blocking read() in
> > user-mode pthread
> > library is simply prohibited.
> >
> > Proposed solution:
> >
> > All changes are in pcapDispatch()
> >  - Use ioctl(myGlobals.device[i].pcapPtr ->fd, BIOCSRTIMEOUT,
> > ...) for restore
> >    timeout value.
> >  - Always set nonblocking mode for pcap
> 
> Why? The existing --set-pcap-nonblock seems to work just fine.  It sounds
> like we should just force the switch set on if it's freebsd...
> 
> What does adding ioctl() do?  It's got to be a poll() type loop anyway - at
> least my version uses nanosleep() so it doesn't peg cpu usage to 100%!

Yep, nanosleep() is properly wrapped by pthread library - not eat CPU time, 
and not block other threads. 

> 
> If you want to create a patch, I guess we can benchmark the two.  I think it
> would make things clearer to me, vs. just words.

Sure, give me some time for this, please. 


Michal Meloun
_______________________________________________
Ntop-dev mailing list
[EMAIL PROTECTED]
http://listgateway.unipi.it/mailman/listinfo/ntop-dev

RE: [Ntop-dev] FreeBSD and pthreads

Reply via email to