Christos Zoulas <[email protected]> writes: > Can you boot with a single processor? Let's try to simplify the > workload.
Should have thought of that myself. It's running with SMP disabled now (as a boot option; I haven't done anything to the BIOS configuration), and this is very interesting. I've got all my regular software running, plus a full system build with "-j 4", to make sure it's kept really busy, and it's showing up hangs. However: the hangs are short (5 to 10 seconds, typically, although I've seen almost 20 a couple of times), and occur at varying intervals, seemingly depending on how much disk access is going on: more often when more is being written to disk. Best of all, when it hangs, the system seems totally unresponsive, neither answering ICMP ECHOs nor echoeing keypresses on the console, but it *is* accessing the disks! The disk lamps flicker, indicating that it's writing stuff, and then, presumably when it's gone through the outstanding writes, the machine continues to run other tasks. Here's a typical snapshot from the ping(1) I've got running on a window on my workstation: 64 bytes from 193.71.27.8: icmp_seq=1190 ttl=255 time=0.676274 ms 64 bytes from 193.71.27.8: icmp_seq=1191 ttl=255 time=0.684655 ms 64 bytes from 193.71.27.8: icmp_seq=1192 ttl=255 time=0.723203 ms 64 bytes from 193.71.27.8: icmp_seq=1193 ttl=255 time=0.727393 ms 64 bytes from 193.71.27.8: icmp_seq=1194 ttl=255 time=8344.118699 ms 64 bytes from 193.71.27.8: icmp_seq=1195 ttl=255 time=7344.353790 ms 64 bytes from 193.71.27.8: icmp_seq=1196 ttl=255 time=6334.641699 ms 64 bytes from 193.71.27.8: icmp_seq=1197 ttl=255 time=5335.350267 ms 64 bytes from 193.71.27.8: icmp_seq=1198 ttl=255 time=4335.631450 ms 64 bytes from 193.71.27.8: icmp_seq=1199 ttl=255 time=3335.894195 ms 64 bytes from 193.71.27.8: icmp_seq=1200 ttl=255 time=2335.999395 ms 64 bytes from 193.71.27.8: icmp_seq=1201 ttl=255 time=1336.099567 ms 64 bytes from 193.71.27.8: icmp_seq=1202 ttl=255 time=336.195548 ms 64 bytes from 193.71.27.8: icmp_seq=1203 ttl=255 time=0.911755 ms 64 bytes from 193.71.27.8: icmp_seq=1204 ttl=255 time=0.553925 ms 64 bytes from 193.71.27.8: icmp_seq=1205 ttl=255 time=0.555601 ms Now, when I'm in SMP mode, the disk lights do *not* flicker while it hangs, so we're dealing with a) something that causes the amr driver to periodically take over completely, probably while it's flushing dirty blocks to the disks, and b) something that causes this situation to lead to much longer (and possibly even permanent) hangs when running on multiple processors. Cool! :) I'm going to look long and hard at /sys/dev/pci/amr.c again, and see if I can figure out some good way to instrument it further -- but I hope you will try to understand why it seems to be stopping everything else while it's chugging through a bunch of outstanding disk operations -- and maybe even why this would get it into such big trouble with SMP. -tih -- Popularity is the hallmark of mediocrity. --Niles Crane, "Frasier"
