Hi,

I have some recursive nameservers, running unbound and 7.2-STABLE #0: Wed Sep 2 13:37:17 CEST 2009 on a bunch of HP BL460c machines (bce interfaces).
These work OK.

During the process of migrating to 8.x, I've upgraded one of these machines to 8.0-STABLE #25: Tue Mar 9 18:15:34 CET 2010 (the dates indicate an approximate time, when the source was checked out from cvsup.hu.freebsd.org, I don't know the exact revision).

The first problem was that the machine occasionally lost network access for some minutes. I could log in on the console, and I could see the processes, involved in network IO in "keglim" state, but couldn't do any network IO. This lasted for some minutes, then everything came back to normal. I could fix this issue by raising kern.ipc.nmbclusters to 51200 (doubling from its default size), when I can't see these blackouts.

But now the machine freezes. It can run for about a day, and then it just freezes. I can't even break in to the debugger with sending NMI to it.
top says:
last pid: 92428; load averages: 0.49, 0.40, 0.38 up 0+21:13:18 07:41:43
43 processes:  2 running, 38 sleeping, 1 zombie, 2 lock
CPU:  1.3% user,  0.0% nice,  1.3% system, 26.0% interrupt, 71.3% idle
Mem: 1682M Active, 99M Inact, 227M Wired, 5444K Cache, 44M Buf, 5899M Free
Swap:

  PID USERNAME   THR PRI NICE   SIZE    RES STATE   C   TIME   WCPU COMMAND
45011 bind         4  49    0  1734M  1722M RUN     2  37:42 22.17% unbound
712 bind 3 44 0 70892K 19904K uwait 0 71:07 3.86% python2.6

The common in these freezes seems to be the high interrupt count. Normally, during load the CPU times look like this:
CPU:  3.5% user,  0.0% nice,  1.8% system,  0.4% interrupt, 94.4% idle

I could observe a "freeze", where top remained running and everything was 0%, except interrupt, which was 25% exactly (the machine has four cores), and another, where I could save the following console output:
CPU:  0.0% user,  0.0% nice,  0.2% system, 50.0% interrupt, 49.8% idle
.......(partial, broken line)....32M  2423M *udp    1  50:16 10.89% unbound
714 bind 3 44 0 70892K 26852K uwait 3 8:41 4.69% python2.6
61004 root         1  62    0 37428K 10876K *udp    1   0:00  1.56% python
706 root 1 44 0 2696K 624K piperd 1 0:07 0.00% readproctit

Both unbound and python accepts DNS requests, and it seems when 25% interrupt happens, only unbound is in *udp state, where it is 50%, both programs are in that state.

_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[email protected]"

Reply via email to