* Peter Lieven (p...@kamp.de) wrote: > Am 07.04.2015 um 21:01 schrieb Dr. David Alan Gilbert: > >* Peter Lieven (p...@kamp.de) wrote: > >>Am 07.04.2015 um 17:29 schrieb Dr. David Alan Gilbert: > >>>* Peter Lieven (p...@kamp.de) wrote: > >>>>Hi David, > >>>> > >>>>Am 07.04.2015 um 10:43 schrieb Dr. David Alan Gilbert: > >>>>>>>>Any particular workload or reproducer? > >>>>>>>Workload is almost zero. I try to figure out if there is a way to > >>>>>>>trigger it. > >>>>>>> > >>>>>>>Maybe playing a role: Machine type is -M pc1.2 and we set -kvmclock as > >>>>>>>CPU flag since kvmclock seemed to be quite buggy in 2.6.16... > >>>>>>> > >>>>>>>Exact cmdline is: > >>>>>>>/usr/bin/qemu-2.2.1 -enable-kvm -M pc-1.2 -nodefaults -netdev > >>>>>>>type=tap,id=guest2,script=no,downscript=no,ifname=tap2 -device > >>>>>>>e1000,netdev=guest2,mac=52:54:00:ff:00:65 -drive > >>>>>>>format=raw,file=iscsi://172.21.200.53/iqn.2001-05.com.equallogic:4-52aed6-88a7e99a4-d9e00040fdc509a3-XXX-hd0/0,if=ide,cache=writeback,aio=native > >>>>>>> -serial null -parallel null -m 1024 -smp > >>>>>>>2,sockets=1,cores=2,threads=1 -monitor tcp:0:4003,server,nowait -vnc > >>>>>>>:3 -qmp tcp:0:3003,server,nowait -name 'XXX' -boot > >>>>>>>order=c,once=dc,menu=off -drive > >>>>>>>index=2,media=cdrom,if=ide,cache=unsafe,aio=native,readonly=on -k de > >>>>>>>-incoming tcp:0:5003 -pidfile /var/run/qemu/vm-146.pid -mem-path > >>>>>>>/hugepages -mem-prealloc -rtc base=utc -usb -usbdevice tablet > >>>>>>>-no-hpet -vga cirrus -cpu qemu64,-kvmclock > >>>>>>> > >>>>>>>Exact kernel is: > >>>>>>>2.6.16.46-0.12-smp (i think this is SLES10 or sth.) > >>>>>>> > >>>>>>>The machine does not hang. It seems just I/O is hanging. So you can > >>>>>>>type at the console or ping the system, but no longer login. > >>>>>>> > >>>>>>>Thank you, > >>>>>>>Peter > >>>>>>Interesting observation: Migrating the vServer again seems to fix to > >>>>>>problem (at least in one case I could test just now). > >>>>>> > >>>>>>2.6.8-24-smp is also affected. > >>>>>How often does it fail - you say 'sometimes' - is it a 1/10 or a 1/1000 ? > >>>>Its more often than 1/10 I would say. > >>>OK, that's not too bad - it's the 1/1000 that are really nasty to find. > >>>In your setup, how easy would it be for you to try : > >>> with either 2.1 or current head? > >>> with a newer machine-type? > >>> without the cdrom? > >>Its all possible. I can clone the system and try everything on my test > >>systems. I hope > >>it reproduces there. > >Great. I think the order I would go would be: > > Try head - if it works we know we've already got the fix somewhere > > Try 2.1 - if it works we know it's something we introduced between > > 2.1 and 2.2.1 > > Try a newer machine type - because pc-1.2 probably isn't tested much > > CDROM at the end. > > Update: > - head -> not working > - 2.1.3 -> not working > - without CROM -> not working > - with head and no machine type specified -> not working > - with -device isa-ide -> BIOS not booting harddisk
Well, at least it's consistent.... > Will now try 1.3.1 just to be sure. > > Any ideas how to debug the IDE state after migration and/or check if the > issue is similar to the ATAPI IDE > problem? It's unlikely to be quite the same - most of the ATAPI problems were related to ATAPI being quite separate and not saving much state. The way I found the CDROM problems was to turn on most of the debugging in the ide and bmdma code and on a failed migrate try and see what the state of any IO was at the point it migrated. One other thing to check; I found the newer kernel code recovers better after IDE problems; so on a newer guest kernel are there any log warnings about IDE problems, even if the guests are otherwise apparently happy? Dave > Thanks, > Peter -- Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK