On 12/27/2014 11:23 PM, Paolo Bonzini wrote: > > > On 26/12/2014 04:31, Yang Hongyang wrote: >> Please feel free to comment. >> We want comments/feedbacks as many as possiable please, thanks in advance. > > Hi Yang, > > I think it's possible to build COLO block replication from many basic > blocks that are already in QEMU. The only new piece would be the disk > buffer on the secondary. > > virtio-blk || > ^ || .---------- > | || | Secondary > 1 Quorum || '---------- > / \ || > / \ || > Primary 2 NBD -------> 2 NBD > disk client || server virtio-blk > || ^ ^ > --------. || | | > Primary | || Secondary disk <--------- COLO buffer 3 > --------' || backing > > > 1) The disk on the primary is represented by a block device with two > children, providing replication between a primary disk and the host that > runs the secondary VM. The read pattern patches for quorum > (http://lists.gnu.org/archive/html/qemu-devel/2014-08/msg02381.html) can > be used/extended to make the primary always read from the local disk > instead of going through NBD. > > 2) The secondary disk receives writes from the primary VM through QEMU's > embedded NBD server (speculative write-through). > > 3) The disk on the secondary is represented by a custom block device > ("COLO buffer"). The disk buffer's backing image is the secondary disk, > and the disk buffer uses bdrv_add_before_write_notifier to implement > copy-on-write, similar to block/backup.c. > > 4) Checkpointing can use new bdrv_prepare_checkpoint and > bdrv_do_checkpoint members in BlockDriver to discard the COLO buffer, > similar to your patches (you did not explain why you do checkpointing in > two steps). Failover instead is done with bdrv_commit or can even be > done without stopping the secondary (live commit, block/commit.c). > > > The missing parts are: > > 1) NBD server on the backing image of the COLO buffer. This means the > backing image needs its own BlockBackend. Apart for this, no new > infrastructure is needed to receive writes on the secondary.
Backing image is always opened read-only. How to remove this limitaion? Add a option to control it? Thanks Wen Congyang > > 2) Read pattern support for quorum need to be extended for the needs of > the COLO primary. It may be simpler or faster to write a simple > "replication" driver that writes to N children but always reads from the > first. But in any case initial tests can be done with the quorum > driver, even without read pattern support. Again, all the network > infrastructure to replicate writes already exists in QEMU. > > 3) Of course the disk buffer itself. > > Paolo > >> Thanks, >> Yang. >> >> Wen Congyang (1): >> PoC: Block replication for COLO >> >> Yang Hongyang (1): >> Block: Block replication design for COLO >> >> block.c | 48 +++++++ >> block/blkcolo.c | 338 >> ++++++++++++++++++++++++++++++++++++++++++++++ >> docs/blkcolo.txt | 85 ++++++++++++ >> include/block/block.h | 6 + >> include/block/block_int.h | 21 +++ >> 5 files changed, 498 insertions(+) >> create mode 100644 block/blkcolo.c >> create mode 100644 docs/blkcolo.txt >> > > . >