On Wed, Aug 27, 2014 at 9:43 AM, Chris Friesen <chris.frie...@windriver.com> wrote: > On 08/25/2014 03:50 PM, Chris Friesen wrote: > >> I think I might have a glimmering of what's going on. Someone please >> correct me if I get something wrong. >> >> I think that VIRTIO_PCI_QUEUE_MAX doesn't really mean anything with >> respect to max inflight operations, and neither does virtio-blk calling >> virtio_add_queue() with a queue size of 128. >> >> I think what's happening is that virtio_blk_handle_output() spins, >> pulling data off the 128-entry queue and calling >> virtio_blk_handle_request(). At this point that queue entry can be >> reused, so the queue size isn't really relevant. >> >> In virtio_blk_handle_write() we add the request to a MultiReqBuffer and >> every 32 writes we'll call virtio_submit_multiwrite() which calls down >> into bdrv_aio_multiwrite(). That tries to merge requests and then for >> each resulting request calls bdrv_aio_writev() which ends up calling >> qemu_rbd_aio_writev(), which calls rbd_start_aio(). >> >> rbd_start_aio() allocates a buffer and converts from iovec to a single >> buffer. This buffer stays allocated until the request is acked, which >> is where the bulk of the memory overhead with rbd is coming from (has >> anyone considered adding iovec support to rbd to avoid this extra copy?). >> >> The only limit I see in the whole call chain from >> virtio_blk_handle_request() on down is the call to >> bdrv_io_limits_intercept() in bdrv_co_do_writev(). However, that >> doesn't provide any limit on the absolute number of inflight operations, >> only on operations/sec. If the ceph server cluster can't keep up with >> the aggregate load, then the number of inflight operations can still >> grow indefinitely. >> >> Chris > > > I was a bit concerned that I'd need to extend the IO throttling code to > support a limit on total inflight bytes, but it doesn't look like that will > be necessary. > > It seems that using mallopt() to set the trim/mmap thresholds to 128K is > enough to minimize the increase in RSS and also drop it back down after an > I/O burst. For now this looks like it should be sufficient for our > purposes. > > I'm actually a bit surprised I didn't have to go lower, but it seems to work > for both "dd" and dbench testcases so we'll give it a try. > > Chris >
Bumping this... For now, we are rarely suffering with an unlimited cache growth issue which can be observed on all post-1.4 versions of qemu with rbd backend in a writeback mode and certain pattern of a guest operations. The issue is confirmed for virtio and can be re-triggered by issuing excessive amount of write requests without completing returned acks from a emulator` cache timely. Since most applications behave in a right way, the oom issue is very rare (and we developed an ugly workaround for such situations long ago). If anybody is interested in fixing this, I can send a prepared image for a reproduction or instructions to make one, whichever is preferable. Thanks!