Am 07.03.2018 um 10:47 schrieb Stefan Hajnoczi: > On Wed, Mar 7, 2018 at 7:55 AM, Peter Lieven <p...@kamp.de> wrote: >> Am 06.03.2018 um 17:35 schrieb Peter Lieven: >>> Am 06.03.2018 um 17:07 schrieb Stefan Hajnoczi: >>>> On Mon, Mar 05, 2018 at 02:52:16PM +0000, Dr. David Alan Gilbert wrote: >>>>> * Peter Lieven (p...@kamp.de) wrote: >>>>>> Am 05.03.2018 um 12:45 schrieb Stefan Hajnoczi: >>>>>>> On Thu, Feb 22, 2018 at 12:13:50PM +0100, Peter Lieven wrote: >>>>>>>> I stumbled across the MAX_INFLIGHT_IO field that was introduced in >>>>>>>> 2015 and was curious what was the reason >>>>>>>> to choose 512MB as readahead? The question is that I found that the >>>>>>>> source VM gets very unresponsive I/O wise >>>>>>>> while the initial 512MB are read and furthermore seems to stay >>>>>>>> unreasponsive if we choose a high migration speed >>>>>>>> and have a fast storage on the destination VM. >>>>>>>> >>>>>>>> In our environment I modified this value to 16MB which seems to work >>>>>>>> much smoother. I wonder if we should make >>>>>>>> this a user configurable value or define a different rate limit for >>>>>>>> the block transfer in bulk stage at least? >>>>>>> I don't know if benchmarks were run when choosing the value. From the >>>>>>> commit description it sounds like the main purpose was to limit the >>>>>>> amount of memory that can be consumed. >>>>>>> >>>>>>> 16 MB also fulfills that criteria :), but why is the source VM more >>>>>>> responsive with a lower value? >>>>>>> >>>>>>> Perhaps the issue is queue depth on the storage device - the block >>>>>>> migration code enqueues up to 512 MB worth of reads, and guest I/O has >>>>>>> to wait? >>>>>> That is my guess. Especially if the destination storage is faster we >>>>>> basically alsways have >>>>>> 512 I/Os in flight on the source storage. >>>>>> >>>>>> Does anyone mind if the reduce that value to 16MB or do we need a better >>>>>> mechanism? >>>>> We've got migration-parameters these days; you could connect it to one >>>>> of those fairly easily I think. >>>>> Try: grep -i 'cpu[-_]throttle[-_]initial' for an example of one that's >>>>> already there. >>>>> Then you can set it to whatever you like. >>>> It would be nice to solve the performance problem without adding a >>>> tuneable. >>>> >>>> On the other hand, QEMU has no idea what the queue depth of the device >>>> is. Therefore it cannot prioritize guest I/O over block migration I/O. >>>> >>>> 512 parallel requests is much too high. Most parallel I/O benchmarking >>>> is done at 32-64 queue depth. >>>> >>>> I think that 16 parallel requests is a reasonable maximum number for a >>>> background job. >>>> >>>> We need to be clear though that the purpose of this change is unrelated >>>> to the original 512 MB memory footprint goal. It just happens to touch >>>> the same constant but the goal is now to submit at most 16 I/O requests >>>> in parallel to avoid monopolizing the I/O device. >>> I think we should really look at this. The variables that control if we >>> stay in the while loop or not are incremented and decremented >>> at the following places: >>> >>> mig_save_device_dirty: >>> mig_save_device_bulk: >>> block_mig_state.submitted++; >>> >>> blk_mig_read_cb: >>> block_mig_state.submitted--; >>> block_mig_state.read_done++; >>> >>> flush_blks: >>> block_mig_state.read_done--; >>> >>> The condition of the while loop is: >>> (block_mig_state.submitted + >>> block_mig_state.read_done) * BLOCK_SIZE < >>> qemu_file_get_rate_limit(f) && >>> (block_mig_state.submitted + >>> block_mig_state.read_done) < >>> MAX_INFLIGHT_IO) >>> >>> At first I wonder if we ever reach the rate-limit because we put the read >>> buffers onto f AFTER we exit the while loop? >>> >>> And even if we reach the limit we constantly maintain 512 I/Os in parallel >>> because we immediately decrement read_done >>> when we put the buffers to f in flush_blks. In the next iteration of the >>> while loop we then read again until we have 512 in-flight I/Os. >>> >>> And shouldn't we have a time limit to limit the time we stay in the while >>> loop? I think we artificially delay sending data to f? >> Thinking about it for a while I would propose the following: >> >> a) rename MAX_INFLIGHT_IO to MAX_IO_BUFFERS >> b) add MAX_PARALLEL_IO with a value of 16 >> c) compare qemu_file_get_rate_limit only with block_mig_state.read_done >> >> This would yield in the following condition for the while loop: >> >> (block_mig_state.read_done * BLOCK_SIZE < qemu_file_get_rate_limit(f) && >> (block_mig_state.submitted + block_mig_state.read_done) < MAX_IO_BUFFERS && >> block_mig_state.submitted < MAX_PARALLEL_IO) >> >> Sounds that like a plan? > That sounds good to me.
I will prepare patches for this. Peter