On Wed, Jul 24, 2013 at 11:42:49PM +0800, Liu Yuan wrote: > On Wed, Jul 24, 2013 at 06:07:21PM +0900, MORITA Kazutaka wrote: > > At Wed, 24 Jul 2013 16:28:30 +0800, > > Liu Yuan wrote: > > > > > > On Wed, Jul 24, 2013 at 04:56:24PM +0900, MORITA Kazutaka wrote: > > > > Currently, if a sheepdog server exits, all the connecting VMs need to > > > > be restarted. This series implements a feature to reconnect the > > > > server, and enables us to do online sheepdog upgrade and avoid > > > > restarting VMs when sheepdog servers crash unexpectedly. > > > > > > > > > > It doesn't work on my test. I tried start linux-0.2.img stored in sheepdog > > > cluster and then > > > > > > 1. did some buffered writes > > > 2. restart sheep that this QEMU VM connected to. > > > 3. $ sync > > > > > > I got following error: > > > > > > $ ../qemu/x86_64-softmmu/qemu-system-x86_64 --enable-kvm -m 1024 -hda > > > sheepdog:test > > > qemu-system-x86_64: failed to get the header, Resource temporarily > > > unavailable > > > qemu-system-x86_64: Failed to connect to socket: Connection refused > > > qemu-system-x86_64: Failed to connect to socket: Connection refused > > > qemu-system-x86_64: Failed to connect to socket: Connection refused > > > qemu-system-x86_64: Failed to connect to socket: Connection refused > > > qemu-system-x86_64: Failed to connect to socket: Connection refused > > > ...repeat... > > > > > > QEMU version is master tip > > > > Your sheep daemon looks like unreachable from qemu. I tried the same > > procedure, but couldn't reproduce it. > > > > Is the problem reproducible? Can you make sure that you can connect > > to the sheep daemon from collie while the error message shows up? > > > > Yesh. Well I try to repeat it with following process: > > 1. did some buffered write > 2. kill the sheep > 3. $ sync # at guest, now 'sync' hang for response > 4. restart sheep > > After 4 'sync' still hangs until timeout with a message > "hda:dma_timer_expiry: dma status == 0x21" > > Guest end up freeze. > > QEMU output is the same: > qemu-system-x86_64: failed to get the header, Resource temporarily unavailable > qemu-system-x86_64: Failed to connect to socket: Connection refused > qemu-system-x86_64: Failed to connect to socket: Connection refused > qemu-system-x86_64: Failed to connect to socket: Connection refused > qemu-system-x86_64: Failed to connect to socket: Connection refused > > But notice, if I did restart sheep with guest doing nothing, your patch set > work > like a charm.
I have debug it a bit. The problem is that at stage 3, 'sync' invoke add_aio_request() in the sheepdog driver and add_aio_request *succeed* with aio put on the inflight_aio_head list, *not* on the failed_aio_head list. So in the reconnect_to_sdog(), we have no way to resend the targeted aio and 'sync' wait for ever. Thanks Yuan