Re: [Qemu-block] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv

Max Reitz Wed, 07 Nov 2018 05:53:06 -0800

On 01.11.18 13:17, Vladimir Sementsov-Ogievskiy wrote:
> 27.09.2018 20:35, Max Reitz wrote:
>> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote:
>>> Memory allocation may become less efficient for encrypted case. It's a
>>> payment for further asynchronous scheme.
>>>
>>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[email protected]>
>>> ---
>>>  block/qcow2.c | 114 
>>> ++++++++++++++++++++++++++++++++--------------------------
>>>  1 file changed, 64 insertions(+), 50 deletions(-)
>>>
>>> diff --git a/block/qcow2.c b/block/qcow2.c
>>> index 65e3ca00e2..5e7f2ee318 100644
>>> --- a/block/qcow2.c
>>> +++ b/block/qcow2.c
>>> @@ -1808,6 +1808,67 @@ out:
>>>      return ret;
>>>  }
>>>  
>>> +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs,
>>> +                                               uint64_t 
>>> file_cluster_offset,
>>> +                                               uint64_t offset,
>>> +                                               uint64_t bytes,
>>> +                                               QEMUIOVector *qiov,
>>> +                                               uint64_t qiov_offset)
>>> +{
>>> +    int ret;
>>> +    BDRVQcow2State *s = bs->opaque;
>>> +    void *crypt_buf = NULL;
>>> +    QEMUIOVector hd_qiov;
>>> +    int offset_in_cluster = offset_into_cluster(s, offset);
>>> +
>>> +    if ((file_cluster_offset & 511) != 0) {
>>> +        return -EIO;
>>> +    }
>>> +
>>> +    qemu_iovec_init(&hd_qiov, qiov->niov);
>> So you're not just re-allocating the bounce buffer for every single
>> call, but also this I/O vector.  Hm.
>>
>>> +    if (bs->encrypted) {
>>> +        assert(s->crypto);
>>> +        assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size);
>>> +
>>> +        crypt_buf = qemu_try_blockalign(bs->file->bs, bytes);
>> 1. Why did you remove the comment?
>>
>> 2. The check for whether crypt_buf was successfully allocated is missing.
>>
>> 3. Do you have any benchmarks for encrypted images?  Having to allocate
>> the bounce buffer for potentially every single cluster sounds rather bad
>> to me.
> 
> Hmm, about this.
> 
> Now we have compress threads to do several compress operations in
> parallel. And I plan to do the same thing for decompression, encryption
> and decryption. So, we'll definitely need several cluster-size buffers
> to do all these operations. How many? The first thought is just as many
> as maximum number of threads (or twice as many, to have in and out
> buffers for compression). But it's wrong, consider for example
> encryption on write:
> 
> 1. get free thread for encryption (may be yield if maximum thread number
> achieved, waiting until all threads are busy)
> 2. get buffer (per thread)
> 3. thread handles encryption
> a. free thread here?
> 4. write encrypted data to underlying file
> b. free thread here?
> 
> Of course, we want free thread as soon as possible, in "a." not in "b.".
> And this mean, that we should free buffer in "a." too, so we should copy
> data to locally allocated buffer, or, better, we just use local buffer
> from the beginning, and thread don't own it's own buffer..
> 
> So, what are the options we have?
> 
> 1. the most simple thing: just allocate buffers for each request
> locally, like it is already done for compression.
> pros: simple
> cons: allocate/free every time may influence performance, as you noticed
> 
> 2. keep a pool of such buffers, so when buffer freed, it's just queued
> to the list of free buffers
> pros:
>    reduce number of real alloc/free calls
>    we can limit the pool maximum size
>    we can allocate new buffers on demand, not the whole pool at start
> cons:
>    hmm, so, it looks like custom allocator. Is it really needed? Is it
> really faster than just use system allocator and call alloc/free every
> time we need a buffer?
> 
> 3. should not we use g_slice_alloc instead of inventing it in "2."
> 
> So, I've decided to do some tests, and here are results (memcpy is for
> 32K, all other things allocates 64K in a loop, list_alloc is a simple
> realization of "2.":
> 
> nothing: 200000000 it / 0.301361 sec = 663655696.202532 it/s
> memcpy: 2000000 it / 1.633015 sec = 1224728.554970 it/s
> g_malloc: 200000000 it / 5.346707 sec = 37406202.540530 it/s
> g_slice_alloc: 200000000 it / 7.812036 sec = 25601520.402792 it/s
> list_alloc: 200000000 it / 5.432771 sec = 36813626.268630 it/s
> posix_memalign: 20000000 it / 1.025543 sec = 19501864.376084 it/s
> aligned_alloc: 20000000 it / 1.035639 sec = 19311752.149739 it/s
> 
> So, you see that yes, list_alloc is twice as fast as posix_memalign, but
> on the other hand, simple memcpy is a lot slower (16 times slower) (and
> I don't say about real disk IO which will be even more slower), so,
> should we care about allocation, what do you thing?
> test is attached.


Agreed.  Thanks for testing.

I am a bit worried about the maximum memory usage still.  With 16
workers, having bounce buffers can mean up to 32 MB of memory usage with
the default cluster size (and 1 GB with 2 MB clusters).  Then again, it
should be easy to limit memory usage even without a custom pool (just
some variable that says how much memory there is left).  Also, it'd be
OK to do that on top.

Max

signature.asc
Description: OpenPGP digital signature

Re: [Qemu-block] [PATCH 3/7] qcow2: split out reading normal clusters from qcow2_co_preadv

Reply via email to