On 01.11.18 13:17, Vladimir Sementsov-Ogievskiy wrote: > 27.09.2018 20:35, Max Reitz wrote: >> On 07.08.18 19:43, Vladimir Sementsov-Ogievskiy wrote: >>> Memory allocation may become less efficient for encrypted case. It's a >>> payment for further asynchronous scheme. >>> >>> Signed-off-by: Vladimir Sementsov-Ogievskiy <[email protected]> >>> --- >>> block/qcow2.c | 114 >>> ++++++++++++++++++++++++++++++++-------------------------- >>> 1 file changed, 64 insertions(+), 50 deletions(-) >>> >>> diff --git a/block/qcow2.c b/block/qcow2.c >>> index 65e3ca00e2..5e7f2ee318 100644 >>> --- a/block/qcow2.c >>> +++ b/block/qcow2.c >>> @@ -1808,6 +1808,67 @@ out: >>> return ret; >>> } >>> >>> +static coroutine_fn int qcow2_co_preadv_normal(BlockDriverState *bs, >>> + uint64_t >>> file_cluster_offset, >>> + uint64_t offset, >>> + uint64_t bytes, >>> + QEMUIOVector *qiov, >>> + uint64_t qiov_offset) >>> +{ >>> + int ret; >>> + BDRVQcow2State *s = bs->opaque; >>> + void *crypt_buf = NULL; >>> + QEMUIOVector hd_qiov; >>> + int offset_in_cluster = offset_into_cluster(s, offset); >>> + >>> + if ((file_cluster_offset & 511) != 0) { >>> + return -EIO; >>> + } >>> + >>> + qemu_iovec_init(&hd_qiov, qiov->niov); >> So you're not just re-allocating the bounce buffer for every single >> call, but also this I/O vector. Hm. >> >>> + if (bs->encrypted) { >>> + assert(s->crypto); >>> + assert(bytes <= QCOW_MAX_CRYPT_CLUSTERS * s->cluster_size); >>> + >>> + crypt_buf = qemu_try_blockalign(bs->file->bs, bytes); >> 1. Why did you remove the comment? >> >> 2. The check for whether crypt_buf was successfully allocated is missing. >> >> 3. Do you have any benchmarks for encrypted images? Having to allocate >> the bounce buffer for potentially every single cluster sounds rather bad >> to me. > > Hmm, about this. > > Now we have compress threads to do several compress operations in > parallel. And I plan to do the same thing for decompression, encryption > and decryption. So, we'll definitely need several cluster-size buffers > to do all these operations. How many? The first thought is just as many > as maximum number of threads (or twice as many, to have in and out > buffers for compression). But it's wrong, consider for example > encryption on write: > > 1. get free thread for encryption (may be yield if maximum thread number > achieved, waiting until all threads are busy) > 2. get buffer (per thread) > 3. thread handles encryption > a. free thread here? > 4. write encrypted data to underlying file > b. free thread here? > > Of course, we want free thread as soon as possible, in "a." not in "b.". > And this mean, that we should free buffer in "a." too, so we should copy > data to locally allocated buffer, or, better, we just use local buffer > from the beginning, and thread don't own it's own buffer.. > > So, what are the options we have? > > 1. the most simple thing: just allocate buffers for each request > locally, like it is already done for compression. > pros: simple > cons: allocate/free every time may influence performance, as you noticed > > 2. keep a pool of such buffers, so when buffer freed, it's just queued > to the list of free buffers > pros: > reduce number of real alloc/free calls > we can limit the pool maximum size > we can allocate new buffers on demand, not the whole pool at start > cons: > hmm, so, it looks like custom allocator. Is it really needed? Is it > really faster than just use system allocator and call alloc/free every > time we need a buffer? > > 3. should not we use g_slice_alloc instead of inventing it in "2." > > So, I've decided to do some tests, and here are results (memcpy is for > 32K, all other things allocates 64K in a loop, list_alloc is a simple > realization of "2.": > > nothing: 200000000 it / 0.301361 sec = 663655696.202532 it/s > memcpy: 2000000 it / 1.633015 sec = 1224728.554970 it/s > g_malloc: 200000000 it / 5.346707 sec = 37406202.540530 it/s > g_slice_alloc: 200000000 it / 7.812036 sec = 25601520.402792 it/s > list_alloc: 200000000 it / 5.432771 sec = 36813626.268630 it/s > posix_memalign: 20000000 it / 1.025543 sec = 19501864.376084 it/s > aligned_alloc: 20000000 it / 1.035639 sec = 19311752.149739 it/s > > So, you see that yes, list_alloc is twice as fast as posix_memalign, but > on the other hand, simple memcpy is a lot slower (16 times slower) (and > I don't say about real disk IO which will be even more slower), so, > should we care about allocation, what do you thing? > test is attached.
Agreed. Thanks for testing. I am a bit worried about the maximum memory usage still. With 16 workers, having bounce buffers can mean up to 32 MB of memory usage with the default cluster size (and 1 GB with 2 MB clusters). Then again, it should be easy to limit memory usage even without a custom pool (just some variable that says how much memory there is left). Also, it'd be OK to do that on top. Max
signature.asc
Description: OpenPGP digital signature
