On 2017/1/13 21:41, Stefan Hajnoczi wrote:
On Mon, Dec 05, 2016 at 04:34:59PM +0800, zhanghailiang wrote:+Issue qmp command: + { 'execute': 'blockdev-add', + 'arguments': { + 'driver': 'replication', + 'node-name': 'rep', + 'mode': 'primary', + 'shared-disk-id': 'primary_disk0', + 'shared-disk': true, + 'file': { + 'driver': 'nbd', + 'export': 'hidden_disk0', + 'server': { + 'type': 'inet', + 'data': { + 'host': 'xxx.xxx.xxx.xxx', + 'port': 'yyy' + } + }block/nbd.c does have good error handling and recovery in case there is a network issue. There are no reconnection attempts or timeouts that deal with a temporary loss of network connectivity. This is a general problem with block/nbd.c and not something to solve in this patch series. I'm just mentioning it because it may affect COLO replication. I'm sure these limitations in block/nbd.c can be fixed but it will take some effort. Maybe block/sheepdog.c, net/socket.c, and other network code could also benefit from generic network connection recovery.
Hmm, good suggestion, but IMHO, here, COLO is a little different from other scenes, if the reconnection method has been implemented, it still needs a mechanism to identify the temporary loss of network connection or real broken in network connection. I did a simple test, just ifconfig down the network card that be used by block replication, It seems that NBD in qemu doesn't has a ability to find the connection has been broken, there was no error reports and COLO just got stuck in vm_stop() where it called aio_poll(). Thanks, Hailiang
Reviewed-by: Stefan Hajnoczi <[email protected]>
