Eventually I stumbled on a way to keep my machines from restarting, its not a great solution, but it stops me from having to deal with the failure on a daily basis. I think that anyone else who is having this problem can do this and it will work. Obviously this is not the right solution, but it works until we can get a fix.
First I made sure this was set: /etc/xen/xend-config.sxp: (dom0-cpus 0) Then I pinned individual physical CPUs to specific domU's, once pinned, the problem stops. What does that mean? Well, Xen does this wacky thing where it creates virtual CPUs (VCPUs), each domU has one of them by default (but you can have more), and then it moves physical CPUs between those VCPUs depending on need. So lets say you have four CPUs, and a domU. That domU has one VCPU by default. That VCPU could actually have the physical CPU 0, 1, 2, 3 all servicing it to provide that VCPU, even at the same time. I found somewhere that this can be a performance hit, because it needs to figure out how to deal with this and switch contexts. I also read that it could cause some instability (!), so pinning the physical CPUs so they don't move around seemed to solve this. The pinning does not stick across reboots, so it has to be done again if the system is rebooted, and it isn't really possible to set this in a startup script, at least I don't think so. So how do you do this? If you look at 'xm vcpu-list' (which annoyingly isn't listed in 'xm help') you will see the CPU column populated with a random CPU, depending on scheduling, and then the CPU Affinity column all say 'any cpu'. This means that any physical CPU could travel between them, and would, depending on the scheduling. Once you pin things, then the individual domU's are set to have specific CPU affinities, so the CPUs don't 'travel' between them, and magically the crash stops. So an example: r...@shoveler:~# xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 1 -b- 283688.8 any cpu Domain-0 0 1 1 --- 39666.3 any cpu Domain-0 0 2 1 r-- 49224.4 any cpu Domain-0 0 3 1 -b- 75591.1 any cpu kite 1 0 3 -b- 71411.8 any cpu murrelet 2 0 0 -b- 472222.2 any cpu test 3 0 0 r-- 342182.3 any cpu So we want to fix that final column using 'xm vcpu-pin' (also a command not listed in 'xm help'): Usage: xm vcpu-pin <Domain> <VCPU|all> <CPUs|all> Set which CPUs a VCPU can use. r...@shoveler:~# xm vcpu-pin 0 0 0 r...@shoveler:~# xm vcpu-pin 0 1 0 r...@shoveler:~# xm vcpu-pin 0 2 0 r...@shoveler:~# xm vcpu-pin 0 3 0 r...@shoveler:~# xm vcpu-pin 1 0 1 r...@shoveler:~# xm vcpu-pin 2 0 2 r...@shoveler:~# xm vcpu-pin 3 0 3 r...@shoveler:~# xm vcpu-list Name ID VCPU CPU State Time(s) CPU Affinity Domain-0 0 0 1 -b- 283700.3 0 Domain-0 0 1 1 r-- 39669.6 0 Domain-0 0 2 1 -b- 49227.4 0 Domain-0 0 3 1 -b- 75596.2 0 kite 1 0 3 -b- 71415.3 1 murrelet 2 0 0 -b- 472237.8 2 test 3 0 0 r-- 342182.3 3 And voila, no more crashes... :P micah -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org