Am 24.03.2014 10:18, schrieb Fam Zheng: > On Fri, 03/21 12:49, Peter Lieven wrote: >> this patch introduces a new flag to indicate that we are going to >> sequentially >> read from a file and do not plan to reread/reuse the data after it has been >> read. >> >> The current use of this flag is to open the source(s) of a qemu-img convert >> process. If a protocol from block/raw-posix.c is used posix_fadvise is >> utilized >> to advise to the kernel that we are going to read sequentially from the >> file and a POSIX_FADV_DONTNEED advise is issued after each write to indicate >> that there is no advantage keeping the blocks in the buffers. >> >> Consider the following test case that was created to confirm the behaviour of >> the new flag: >> >> A 10G logical volume was created and filled with random data. >> Then the logical volume was exported via qemu-img convert to an iscsi target. >> Before the export was started all caches of the linux kernel where dropped. >> >> Old behavior: >> - The convert process took 3m45s and the buffer cache grew up to 9.67 GB >> close >> to the end of the conversion. After qemu-img terminated all the buffers >> were >> freed by the kernel. >> >> New behavior with the -N switch: >> - The convert process took 3m43s and the buffer cache grew up to 15.48 MB >> close >> to the end with some small peaks up to 30 MB durine the conversion. > s/durine/during/ > > The patch looks OK, and I have no objection with this flag. But I'm still > curious about the use case: Host page cache growing is not the real problem, > I'm not fully persudaded by commit message because I still don't know _what_ > useful cache would be dropped (if you don't empty the kernel cache before > starting). I don't think all 9.67 GB buffer will be filled by data from this > volume, so the question is how to measure the real, effective performance > impact? I ran an idle machine and indeed all the 9.67GB are buffered from the qemu-img process. The problem is that the growing buffers eventually disposses other pages from the cache. As for sharing if you have a drive of a vServer on a lvm logical volume and take a snapshot and you fadvise data from the snapshot I think that shared pages between the logical volume and its snapshot are dropped. However, this all depends on how it is handled internally. Maybe Markus has more evidence. I personally would always disable the cache entirely for my vServers harddrives. In general I personally am totally happy with having a switch. Just in case there are some side effects we don't see at this point. > >> Signed-off-by: Peter Lieven <p...@kamp.de> >> --- >> v1->v2: - added test example to commit msg >> - added -N knob to qemu-img >> >> block/raw-posix.c | 14 ++++++++++++++ >> include/block/block.h | 1 + >> qemu-img-cmds.hx | 4 ++-- >> qemu-img.c | 16 +++++++++++++--- >> qemu-img.texi | 9 ++++++++- >> 5 files changed, 38 insertions(+), 6 deletions(-) >> >> diff --git a/block/raw-posix.c b/block/raw-posix.c >> index 1688e16..08f7209 100644 >> --- a/block/raw-posix.c >> +++ b/block/raw-posix.c >> @@ -444,6 +444,13 @@ static int raw_open_common(BlockDriverState *bs, QDict >> *options, >> } >> #endif >> >> +#ifdef POSIX_FADV_SEQUENTIAL >> + if (bs->open_flags & BDRV_O_SEQUENTIAL && >> + !(bs->open_flags & BDRV_O_NOCACHE)) { >> + posix_fadvise(s->fd, 0, 0, POSIX_FADV_SEQUENTIAL); >> + } >> +#endif >> + >> ret = 0; >> fail: >> qemu_opts_del(opts); >> @@ -913,6 +920,13 @@ static int aio_worker(void *arg) >> ret = aiocb->aio_nbytes; >> } >> if (ret == aiocb->aio_nbytes) { >> +#ifdef POSIX_FADV_DONTNEED >> + if (aiocb->bs->open_flags & BDRV_O_SEQUENTIAL && >> + !(aiocb->bs->open_flags & BDRV_O_NOCACHE)) { >> + posix_fadvise(aiocb->aio_fildes, aiocb->aio_offset, >> + aiocb->aio_nbytes, POSIX_FADV_DONTNEED); >> + } >> +#endif > I'm not familiar with posix_fadvise, can we do this on the whole file in once > in raw_open_common like POSIX_FADV_SEQUENTIAL? We could do it, but the usage I have seen it call it on the pages you actually want to have dropped. At least this seems to work good.
Peter