Michal - a nice piece of work. I think you've finally put all the pieces
together.  I do think however, that there are some bugs which should be
reported to the various maintainers.  And I'm still not happy with your
proposed solution.

See my notes in-line...

Where the heck did you find the documentation for all of this?  I've looked
numerous times and never came close...  I guess I don't know the magic words
to search for in the BSD sites...

A++

-----Burton


> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Behalf Of Meloun Michal
> Sent: Thursday, February 05, 2004 6:58 AM
> To: [EMAIL PROTECTED]
> Subject: [Ntop-dev] FreeBSD and pthreads


> Hi all,
> First, please, excuse my English - may English skill is vary bad, I know.

Nah. your English is fine.  MUCH, much better than my zero words of Czech.

If you really want, email me off line and I'll do a edit for you, but it's
really just the occasional missing adjective/adverb that us native speakers
(oh bother with that 'Queen's English' nonsense all you Brit's) had drummed
into us over 12-20 years of schooling.  Stuff like "I probably found real
explanation" should be "I have probably found the real explanation".  But
don't sweat it - it certainly doesn't make the information any less useful
nor incorrect!

> I probably found real explanation for FreeBSD locking problem -
> the problem
> is common for all systems with user-mode pthread library - and
> unfortunately,
> it's not a bug - all works as declared (plus/minus). Also, I must say, my
> previous trace to pcap_open_live() is not in direct relation with
> root of this
> problem. More that, I can (probably) also explain high packet
> drops and high
> CPU usage, with are seen on *BSD platforms. Both issues are only
> manifestation
> of same "base" problem.

> The main cause is user-mode pthread library ant its interaction with bpf
> interface.

<rant>I've read and re-read this and it still, IMHO, comes down to the fact
that xBSD has user-land threads (whether that's just FreeBSD, OpenBSD,
NetBSD all or some of them is irrelevant) and the way libpcap uses the
/dev/bpfN stuff is defective.  That's a libpcap problem and should be
reported as such - the whole dang idea of using libpcap is to hide all of
the device and OS specific issues from ntop!  libpcap already has a
specialized freebsd module, why doesn't the fix go in there?????</rant>

> Some theory first:
> User-mode pthread library implement thread in user space of
> process, without
> any special support from kernel. The kernel has no knowledge about user
> threads.  All thread operation (context switching, thread scheduling etc)
> are made in user mode library. This implementation has some major issues,
> but only one is for us.

So to the kernel, there is ONE thread.  It's either in run state or blocked.
Why it's blocked isn't important (to the kernel).

This means that all of the POSIX stuff is simulated.  Which is perfectly
acceptable to the POSIX standard - it's pretty loose about the specifics of
which thread gets the signal and who gets woken when - having a control
thread is perfectly legal.

> If any of user threads blocks in syscall, then whole application
> (all threads)
> blocks too. This is very important issue. To avoid this, the
> pthreads library
> contains wrappers over all blocking syscalls, and implements some
> mechanism
> which converts (potential) blocking syscalls into nonblocking
> variants.  But,
> this mechanism is only workaround in most cases. The only "ready for use"
> blocking syscalls are nanosleep() and select(), nothing more. This two
> syscalls are implemented properly (doesn't block in kernel and
> doesn't eat CPU
> actively) The all other blocking syscals are converted into non-blocking
> variants, witch waits in active cycle for completion - so thread
> consumes 100%
> of CPU time), or passed directly to kernel (if nonblocking
> variant not exist)
> - so syscall block all other threads until return from kernel.

What you call an 'active' process is typically refered to as polling()
Google define:polling has this about 1/3 of the way down:

The process of repeatedly testing until a condition becomes true. Polling
can be inefficient if the time between tests is short compared with the time
it will take for the condition to become true. A polling thread should sleep
between consecutive tests in order to give other threads a chance to run. An
alternative approach to polling is to arrange for an interrupt to be sent
when the condition is true, or to use the wait and notify mechanism
associated with threads.
www.cs.ukc.ac.uk/people/staff/djb/book/glossary.html

But returning to your analysis, most calls are wrapped to convert them into
polling calls. OK, that certainly explains the cpu usage (and incidentally
why changing select() to poll() didn't change anything in my tests).

> Allow me explain it on standard file operations.
>
> The code:
>
> fd = open(..);
> read(fd, ?)
>
> executed in user-mode pthread library context  works little differently:
>
> fd = open(?) is called with O_NONBLOCK flag  added, so all
> operation on this
> fd are nonblocking ( flag is added in uthread_fd.c, using
> __sys_fcntl(fd, F_SETFL, entry->flags | O_NONBLOCK);).
>
> And the read(fd,..) is converted into active wait loop (code is striped) :
>
>  while ((ret = __sys_read(fd, buf, nbytes)) < 0) {
>    if (errno != EWOULDBLOCK)
>     break;
> }
>
> So the reading thread actively waits for read completion.

Nice example.  Crystal clear - and all of this is happening under the
covers, so to speak, so when you port 'Standard Unix' code, you never see
the differences except in the CPU usage...

> This is very important mainly with interaction with /dev/bpfxxx
> device driver
> (pcap library uses this device driver). The behavior of this
> drive is special
> (other that regular files, pipe or sockets) in many cases. One difference
> (important) is: the O_NONBLOCK in open is ignored, and
> nonblocking mode must
> be set using ioctl(). So if read() is executed on bpf device then simply
> block all threads within process until read syscall returns back into
> user space.

I really don't understand /dev/bpfN.  Other than a lot of complaints about
'not enough' devices, there's not much information.

As a guess, this is a 'fake' device which allows you to read only those
packets which pass the filter expression.  So conceptually there's a
/dev/sis0 (or whatever the NIC is called) that you can - in userland with
the right permissions - read all packets from, and a /dev/bpfN that - also
in userland - you can read only those that pass the filter.  This means only
the filter program has to be attached to the kernel, instead of the entire
libpcap.

Right or Wrong???


Interestingly, with the hints from your work, I was able to find this
http://www.ethereal.com/lists/ethereal-dev/199901/threads.html#00014, esp.
http://www.ethereal.com/lists/ethereal-dev/199901/msg00050.html which is a
not totally coherent discussion of the state of threads a few years ago,
userland vs. kernel threads, etc.  What makes it interesting is that they
discuss the wrappers and make it seem that select() working at all with
/dev/bpfN is more accidental than purposeful.

> Unfortunately,  ntop uses pcap_dispatch() in main packet reading thread.
> And pcap_dispatch() is implemented using this code (pcap-bpf.c):
>       if (p->cc == 0) {
>               cc = read(p->fd, (char *)p->buffer, p->bufsize);

So shouldn't this be wrapped, perhaps with the internal version of
sched_yield()???  I mean isn't this really the libpcap bug we're fighting??
If they properly converted this, it would return EAGAIN and you could wrap
it thusly:

        if (p->cc == 0) {
                do {cc = read(p->fd, (char *)p->buffer, p->bufsize);
                              if (cc != EAGAIN) break;
                             sched_yield()
                         };

????

> so then this call blocks all other ntop threads (including web
> server in select),
> until bpf device return filled buffer back to ntop. And this can be very
> long time, on lightly loaded network.

Right... I understand it now.  Explains why the web server seems to come up
after a long while and even can be snappy.  Also explains why Stanley saw
some machines work and others hang - it's roughly dependent on enough
traffic to keep this from 'hanging'.

>And worse that,  ntop code calls
> sched_yield() at many places -> so then main packed thread is regulary
> scheduled - and block occurs again and again.

Maybe / maybe not.  Just because you give up the CPU doesn't mean you don't
get it back - there's no guarantee.  In fact, depending on whether
sched_yield() is wrapped (so the remainder of the slice goes to another
user-land thread) or not (returns to the kernel), you could see vastly
different results.  This probably explains why the --disable-schedyield flag
helps.

We also don't call it 'a lot of places', they're mostly in the middle of the
purge process so that we don't starve everthing else.  There's also a set of
calls when we queue a packet instead of processing it directly and we then
sched_yield() so maybe dequeue gets a shot right then instead of attempting
to process the next packet and blocking.

> The hang is hard (read never returns) if y have no traffic on network
> (im no sure about signals delivery here). But, because ntop uses
> pcap_open_live()
> with timeout, then  pcap_dispatch() returns at regular intervals,
> allowing
> slowly processing of other threads.

Signals would be a POSIX issue - there's no guarantee which thread gets
them.  Practically, from your description, they all go to this thread
controller process and which one IT picks to wake up isn't documented.  It
could be one we want, or it could just go back to sleep...

> Unfortunately, this have next one issue (probably FreeBSD specific).

Really, I think ALL of this is FreeBSD specific (or at least userland
threads OS specific).  Since we don't run under NetBSD except single
threaded and we have only one user under OpenBSD - whom I haven't heard from
since a last 'it starts up and works (for a little while)' report, for ntop,
userland threads == FreeBSD.

> The pcap_open_live() function uses BIOCSRTIMEOUT ioctl to pass timeout
> value down to bpf driver.
> But man page for bpf  have this sentence:
>    BIOCGRTIMEOUT  (struct timeval)
>                     Set or get the read timeout parameter.
>                     The argument specifies the length of time to
> wait before
>                     timing out on a read request.  This parameter
> is initial-
>                     ized to zero by open(2), indicating no timeout.
>
>
> Note -> "This parameter is initialized to zero by open(2),
> indicating no timeout".

Here's where I think there is a bug in libpcap or in the thread wrapper.

Either the wrapped calls library or libpcap itself should be CONSISTENTLY
converting these blocking calls into non-blocking + polling calls.  if it's
documented that the parameter isn't honored, then the #if FREEBSD code
should be in one of those two libraries, not expecting our program to make a
special call for this one variant case...

WHY IS THIS ONE CALL DIFFERENT THAN ALL THE OTHERS???  If there's a reason
for it, then it should be 'well known' and the libpcap code should make the
ioctl() call.  If there's no good reason for it, then the thread wrapper
should be fixed.  Either way, it's not ntop's bug, although we are going to
have to code around it, just like the PR 53515 crud...

> And because fork() uses dup2() for file descriptor cloning, and dup2()
> on FreeBSD uses open(), then fork() also clears timeout value.
> This explain why
> is order of fork() and  pcap_open_live() important  - in one
> case, the pcap_dispatch()
> blocks in kernel until get data without timeout, in second case
> pcap_dispatch() has 100ms timeout - so other threads can runs.

This is fun.  The usual answer is that "POSIX threads and fork() don't
co-exist" (the fact that everyone does it is conveniently forgotten).  But
if I'm understanding the actual 'hang', you should be able to construct a
small failing program without a single fork() call.  Just libpcap+POSIX.
That might get attention!

> But, the bug is here in all cases - using of blocking read() in
> user-mode pthread
> library is simply prohibited.
>
> Proposed solution:
>
> All changes are in pcapDispatch()
>  - Use ioctl(myGlobals.device[i].pcapPtr ->fd, BIOCSRTIMEOUT,
> ...) for restore
>    timeout value.
>  - Always set nonblocking mode for pcap

Why? The existing --set-pcap-nonblock seems to work just fine.  It sounds
like we should just force the switch set on if it's freebsd...

What does adding ioctl() do?  It's got to be a poll() type loop anyway - at
least my version uses nanosleep() so it doesn't peg cpu usage to 100%!

If you want to create a patch, I guess we can benchmark the two.  I think it
would make things clearer to me, vs. just words.

>  - And (and mainly) use select() before pcap_dispatch()

I still renew my objection - I don't know what this is going to do to ntop.
If nothing else, it makes my planned post 3.0 thread watchdog impossible,
since I would have to stop the packet capture, restart the web server and
then restart packet capture. Nixing the whole idea of being able to restart
a dead web server w/o impacting the counts we've already accumulated.

> I'm ready to answer to any addition question, or if anything
> needs be more
> detailed, simply anything.
>
> Michal Meloun

_______________________________________________
Ntop-dev mailing list
[EMAIL PROTECTED]
http://listgateway.unipi.it/mailman/listinfo/ntop-dev

Reply via email to