Re: [PATCH 0/7] qcow2: compressed write cache

Denis V. Lunev Tue, 09 Feb 2021 10:48:31 -0800

On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote:
> 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote:
>> 09.02.2021 17:47, Max Reitz wrote:
>>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote:
>>>> 09.02.2021 16:25, Max Reitz wrote:
>>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote:
>>>>>> Hi all!
>>>>>>
>>>>>> I know, I have several series waiting for a resend, but I had to
>>>>>> switch
>>>>>> to another task spawned from our customer's bug.
>>>>>>
>>>>>> Original problem: we use O_DIRECT for all vm images in our
>>>>>> product, it's
>>>>>> the policy. The only exclusion is backup target qcow2 image for
>>>>>> compressed backup, because compressed backup is extremely slow with
>>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup
>>>>>> produces a lot of pagecache.
>>>>>>
>>>>>> So we can either implement some internal cache or use fadvise
>>>>>> somehow.
>>>>>> Backup has several async workes, which writes simultaneously, so
>>>>>> in both
>>>>>> ways we have to track host cluster filling (before dropping the
>>>>>> cache
>>>>>> corresponding to the cluster).  So, if we have to track anyway,
>>>>>> let's
>>>>>> try to implement the cache.
>>>>>
>>>>> I wanted to be excited here, because that sounds like it would be
>>>>> very easy to implement caching.  Like, just keep the cluster at
>>>>> free_byte_offset cached until the cluster it points to changes,
>>>>> then flush the cluster.
>>>>
>>>> The problem is that chunks are written asynchronously.. That's why
>>>> this all is not so easy.
>>>>
>>>>>
>>>>> But then I see like 900 new lines of code, and I’m much less
>>>>> excited...
>>>>>
>>>>>> Idea is simple: cache small unaligned write and flush the cluster
>>>>>> when
>>>>>> filled.
>>>>>>
>>>>>> Performance result is very good (results in a table is time of
>>>>>> compressed backup of 1000M disk filled with ones in seconds):
>>>>>
>>>>> “Filled with ones” really is an edge case, though.
>>>>
>>>> Yes, I think, all clusters are compressed to rather small chunks :)
>>>>
>>>>>
>>>>>> ---------------  -----------  -----------
>>>>>>                   backup(old)  backup(new)
>>>>>> ssd:hdd(direct)  3e+02        4.4
>>>>>>                                  -99%
>>>>>> ssd:hdd(cached)  5.7          5.4
>>>>>>                                  -5%
>>>>>> ---------------  -----------  -----------
>>>>>>
>>>>>> So, we have benefit even for cached mode! And the fastest thing is
>>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new
>>>>>> cache by default (which is done by the series).
>>>>>
>>>>> First, I’m not sure how O_DIRECT really is relevant, because I
>>>>> don’t really see the point for writing compressed images.
>>>>
>>>> compressed backup is a point
>>>
>>> (Perhaps irrelevant, but just to be clear:) I meant the point of
>>> using O_DIRECT, which one can decide to not use for backup targets
>>> (as you have done already).
>>>
>>>>> Second, I find it a bit cheating if you say there is a huge
>>>>> improvement for the no-cache case, when actually, well, you just
>>>>> added a cache.  So the no-cache case just became faster because
>>>>> there is a cache now.
>>>>
>>>> Still, performance comparison is relevant to show that O_DIRECT as
>>>> is unusable for compressed backup.
>>>
>>> (Again, perhaps irrelevant, but:) Yes, but my first point was
>>> exactly whether O_DIRECT is even relevant for writing compressed
>>> images.
>>>
>>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much
>>>>> sense for compressed images, qemu’s format drivers are free to
>>>>> introduce some caching (because technically the cache.direct
>>>>> option only applies to the protocol driver) for collecting
>>>>> compressed writes.
>>>>
>>>> Yes I thought in this way, enabling the cache by default.
>>>>
>>>>> That conclusion makes both of my complaints kind of moot.
>>>>>
>>>>> *shrug*
>>>>>
>>>>> Third, what is the real-world impact on the page cache?  You
>>>>> described that that’s the reason why you need the cache in qemu,
>>>>> because otherwise the page cache is polluted too much.  How much
>>>>> is the difference really?  (I don’t know how good the compression
>>>>> ratio is for real-world images.)
>>>>
>>>> Hm. I don't know the ratio.. Customer reported that most of RAM is
>>>> polluted by Qemu's cache, and we use O_DIRECT for everything except
>>>> for target of compressed backup.. Still the pollution may relate to
>>>> several backups and of course it is simple enough to drop the cache
>>>> after each backup. But I think that even one backup of 16T disk may
>>>> pollute RAM enough.
>>>
>>> Oh, sorry, I just realized I had a brain fart there.  I was
>>> referring to whether this series improves the page cache pollution. 
>>> But obviously it will if it allows you to re-enable O_DIRECT.
>>>
>>>>> Related to that, I remember a long time ago we had some discussion
>>>>> about letting qemu-img convert set a special cache mode for the
>>>>> target image that would make Linux drop everything before the last
>>>>> offset written (i.e., I suppose fadvise() with
>>>>> POSIX_FADV_SEQUENTIAL).  You discard that idea based on the fact
>>>>> that implementing a cache in qemu would be simple, but it isn’t,
>>>>> really.  What would the impact of POSIX_FADV_SEQUENTIAL be?  (One
>>>>> advantage of using that would be that we could reuse it for
>>>>> non-compressed images that are written by backup or qemu-img
>>>>> convert.)
>>>>
>>>> The problem is that writes are async. And therefore, not sequential.
>>>
>>> In theory, yes, but all compressed writes still goes through
>>> qcow2_alloc_bytes() right before submitting the write, so I wonder
>>> whether in practice the writes aren’t usually sufficiently
>>> sequential to make POSIX_FADV_SEQUENTIAL work fine.
>>
>> Yes, allocation is sequential. But writes are not.. Reasonable, I
>> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for
>> the whole backup target before the backup? Will try. Still, I expect
>> that my cache will show better performance anyway. Actually,
>> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e.
>> 20% faster, which is significant (still, yes, would be good to check
>> it on more real case than all-ones).
>>
>>>
>>>> So
>>>> I have to track the writes and wait until the whole cluster is
>>>> filled. It's simple use fadvise as an option to my cache: instead
>>>> of caching data and write when cluster is filled we can instead
>>>> mark cluster POSIX_FADV_DONTNEED.
>>>>
>>>>>
>>>>> (I don’t remember why that qemu-img discussion died back then.)
>>>>>
>>>>>
>>>>> Fourth, regarding the code, would it be simpler if it were a pure
>>>>> write cache?  I.e., on read, everything is flushed, so we don’t
>>>>> have to deal with that.  I don’t think there are many valid cases
>>>>> where a compressed image is both written to and read from at the
>>>>> same time. (Just asking, because I’d really want this code to be
>>>>> simpler.  I can imagine that reading from the cache is the least
>>>>> bit of complexity, but perhaps...)
>>>>>
>>>>
>>>> Hm. I really didn't want to support reads, and do it only to make
>>>> it possible to enable the cache by default.. Still read function is
>>>> really simple, and I don't think that dropping it will simplify the
>>>> code significantly.
>>>
>>> That’s too bad.
>>>
>>> So the only question I have left is what POSIX_FADV_SEQUENTIAL
>>> actually would do in practice.
>>
>> will check.
>>
>
> Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not
> removed.
>
> Test:
> [root@kvm fadvise]# cat a.c
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <getopt.h>
> #include <string.h>
> #include <stdbool.h>
>
> int main(int argc, char *argv[])
> {
>     int fd;
>     int i;
>     char mb[1024 * 1024];
>     int open_flags = O_RDWR | O_CREAT | O_EXCL;
>     int fadv_flags = 0;
>     int fadv_final_flags = 0;
>     int len = 1024 * 1024;
>     bool do_fsync = false;
>
>     for (i = 1; i < argc - 1; i++) {
>         const char *arg = argv[i];
>
>         if (!strcmp(arg, "direct")) {
>             open_flags |= O_DIRECT;
>         } else if (!strcmp(arg, "seq")) {
>             fadv_flags = POSIX_FADV_SEQUENTIAL;
>         } else if (!strcmp(arg, "dontneed")) {
>             fadv_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "final-dontneed")) {
>             fadv_final_flags = POSIX_FADV_DONTNEED;
>         } else if (!strcmp(arg, "fsync")) {
>             do_fsync = true;
>         } else {
>             fprintf(stderr, "unknown: %s\n", arg);
>             return 1;
>         }
>     }
>
>     fd = open(argv[argc - 1], open_flags);
>
>     if (fd < 0) {
>         fprintf(stderr, "failed to open\n");
>         return 1;
>     }
>
>     if (fadv_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags);
>     }
>     for (i = 0; i < 100; i++) {
>         write(fd, mb, len);
>     }
>     if (fadv_final_flags) {
>         posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags);
>     }
>     if (do_fsync) {
>         fsync(fd);
>     }
>
>     close(fd);
> }
>
>
>
> [root@kvm fadvise]# gcc a.c
> [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>  100M 25600  100M x
> [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x
>   RES PAGES  SIZE FILE
>   36M  9216  100M x
> [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x
> RES PAGES SIZE FILE
>  0B     0   0B x
>
>
>
> Backup-generated pagecache is a formal trash, it will be never used.
> And it's bad that it can displace another good pagecache. So the best
> thing for backup is direct mode + proposed cache.
>
>
I think that the original intention of Max is about POSIX_FADV_DONTNEED
One should call this fadvise for just fully written cluster. Though this is
a bit buggy and from performance point of view will be slower.


This call could be slow and thus it should be created as delegate to
co-routine. We have though on this, but the amount of work to
implement even if seems lower, is not really lower and the result
would not be as great.

Den

Re: [PATCH 0/7] qcow2: compressed write cache

Reply via email to