On Mon, 11/09 17:29, Kevin Wolf wrote: > Am 09.11.2015 um 17:18 hat Paolo Bonzini geschrieben: > > > > > > On 09/11/2015 17:04, Kevin Wolf wrote: > > > Am 06.11.2015 um 11:22 hat Fam Zheng geschrieben: > > >> The "pnum < nb_sectors" condition in deciding whether to actually copy > > >> data is unnecessarily strict, and the qiov initialization is > > >> unnecessarily too, for both bdrv_aio_write_zeroes and bdrv_aio_discard > > >> branches. > > >> > > >> Reorganize mirror_iteration flow so that we: > > >> > > >> 1) Find the contiguous zero/discarded sectors with > > >> bdrv_get_block_status_above() before deciding what to do. We query > > >> s->buf_size sized blocks at a time. > > >> > > >> 2) If the sectors in question are zeroed/discarded and aligned to > > >> target cluster, issue zero write or discard accordingly. It's done > > >> in mirror_do_zero_or_discard, where we don't add buffer to qiov. > > >> > > >> 3) Otherwise, do the same loop as before in mirror_do_read. > > >> > > >> Signed-off-by: Fam Zheng <f...@redhat.com> > > > > > > I'm not sure where in the patch to comment on this, so I'll just do it > > > here right in the beginning. > > > > > > I'm concerned that we need to be more careful about races in this patch, > > > in particular regarding the bitmaps. I think the conditions for the two > > > bitmaps are: > > > > > > * Dirty bitmap: We must clear the bit after finding the next piece of > > > data to be mirrored, but before we yield after getting information > > > that we use for the decision which kind of operation we need. > > > > > > In other words, we need to clear the dirty bitmap bit before calling > > > bdrv_get_block_status_above(), because that's both the function that > > > retrieves information about the next chunk and also a function that > > > can yield. > > > > > > If after this point the data is written to, we need to mirror it > > > again. > > > > With Fam's patch, that's not trivial for two reasons: > > > > 1) bdrv_get_block_status_above() can return a smaller amount than what > > is asked. > > > > 2) the "read and write" case can handle s->granularity sectors per > > iteration (many of them can be coalesced, but still that's how the > > iteration works). > > > > The simplest solution is to perform the query with s->granularity size > > rather than s->buf_size. > > Then we end up with many small operations, that's not what we want. > > Why can't we mark up to s->buf_size dirty clusters as clean first, then > query the status, and mark all of those that we can't handle dirty > again?
Then we may end up marking more clusters as dirty than it should be. Because all bdrv_set_dirty() and bdrv_set_dirty_bitmap() callers are coroutine, we can introduce a CoMutex to let bitmap reader block bdrv_set_dirty and bdrv_set_dirty_bitmap. Fam > > Kevin > > > > * In-flight bitmap: We need to make sure that we never mirror the same > > > data twice at the same time as older data could overwrite newer data > > > then. > > > > > > Strictly speaking, it looks to me as if this meant we can delay > > > setting the bit until before we issue an AIO operation. It might be > > > more obviously correct to set it at the same time as the dirty bitmap > > > is cleared. > > > >