On Jan 28, 7:37pm, [email protected] (Tom Ivar Helbekkmo) wrote: -- Subject: Re: NetBSD-current on amd64 with Dell PERC 4e/Di hangs under load
| Christos Zoulas <[email protected]> writes: | | > Can you boot with a single processor? Let's try to simplify the | > workload. | | Should have thought of that myself. It's running with SMP disabled now | (as a boot option; I haven't done anything to the BIOS configuration), | and this is very interesting. I've got all my regular software running, | plus a full system build with "-j 4", to make sure it's kept really | busy, and it's showing up hangs. | | However: the hangs are short (5 to 10 seconds, typically, although I've | seen almost 20 a couple of times), and occur at varying intervals, | seemingly depending on how much disk access is going on: more often when | more is being written to disk. Best of all, when it hangs, the system | seems totally unresponsive, neither answering ICMP ECHOs nor echoeing | keypresses on the console, but it *is* accessing the disks! The disk | lamps flicker, indicating that it's writing stuff, and then, presumably | when it's gone through the outstanding writes, the machine continues to | run other tasks. Here's a typical snapshot from the ping(1) I've got | running on a window on my workstation: | | 64 bytes from 193.71.27.8: icmp_seq=1190 ttl=255 time=0.676274 ms | 64 bytes from 193.71.27.8: icmp_seq=1191 ttl=255 time=0.684655 ms | 64 bytes from 193.71.27.8: icmp_seq=1192 ttl=255 time=0.723203 ms | 64 bytes from 193.71.27.8: icmp_seq=1193 ttl=255 time=0.727393 ms | 64 bytes from 193.71.27.8: icmp_seq=1194 ttl=255 time=8344.118699 ms | 64 bytes from 193.71.27.8: icmp_seq=1195 ttl=255 time=7344.353790 ms | 64 bytes from 193.71.27.8: icmp_seq=1196 ttl=255 time=6334.641699 ms | 64 bytes from 193.71.27.8: icmp_seq=1197 ttl=255 time=5335.350267 ms | 64 bytes from 193.71.27.8: icmp_seq=1198 ttl=255 time=4335.631450 ms | 64 bytes from 193.71.27.8: icmp_seq=1199 ttl=255 time=3335.894195 ms | 64 bytes from 193.71.27.8: icmp_seq=1200 ttl=255 time=2335.999395 ms | 64 bytes from 193.71.27.8: icmp_seq=1201 ttl=255 time=1336.099567 ms | 64 bytes from 193.71.27.8: icmp_seq=1202 ttl=255 time=336.195548 ms | 64 bytes from 193.71.27.8: icmp_seq=1203 ttl=255 time=0.911755 ms | 64 bytes from 193.71.27.8: icmp_seq=1204 ttl=255 time=0.553925 ms | 64 bytes from 193.71.27.8: icmp_seq=1205 ttl=255 time=0.555601 ms | | Now, when I'm in SMP mode, the disk lights do *not* flicker while it | hangs, so we're dealing with a) something that causes the amr driver to | periodically take over completely, probably while it's flushing dirty | blocks to the disks, and b) something that causes this situation to lead | to much longer (and possibly even permanent) hangs when running on | multiple processors. | | Cool! :) | | I'm going to look long and hard at /sys/dev/pci/amr.c again, and see if | I can figure out some good way to instrument it further -- but I hope | you will try to understand why it seems to be stopping everything else | while it's chugging through a bunch of outstanding disk operations -- | and maybe even why this would get it into such big trouble with SMP. Excellent! This sounds like a very interesting problem... I am being pulled in every which direction right now, so I don't have much time to look into it, but I'll try to do so over the weekend (look at amr.c). Good luck! christos
