On 2/9/21 9:36 PM, Vladimir Sementsov-Ogievskiy wrote: > 09.02.2021 19:39, Vladimir Sementsov-Ogievskiy wrote: >> 09.02.2021 17:47, Max Reitz wrote: >>> On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote: >>>> 09.02.2021 16:25, Max Reitz wrote: >>>>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote: >>>>>> Hi all! >>>>>> >>>>>> I know, I have several series waiting for a resend, but I had to >>>>>> switch >>>>>> to another task spawned from our customer's bug. >>>>>> >>>>>> Original problem: we use O_DIRECT for all vm images in our >>>>>> product, it's >>>>>> the policy. The only exclusion is backup target qcow2 image for >>>>>> compressed backup, because compressed backup is extremely slow with >>>>>> O_DIRECT (due to unaligned writes). Customer complains that backup >>>>>> produces a lot of pagecache. >>>>>> >>>>>> So we can either implement some internal cache or use fadvise >>>>>> somehow. >>>>>> Backup has several async workes, which writes simultaneously, so >>>>>> in both >>>>>> ways we have to track host cluster filling (before dropping the >>>>>> cache >>>>>> corresponding to the cluster). So, if we have to track anyway, >>>>>> let's >>>>>> try to implement the cache. >>>>> >>>>> I wanted to be excited here, because that sounds like it would be >>>>> very easy to implement caching. Like, just keep the cluster at >>>>> free_byte_offset cached until the cluster it points to changes, >>>>> then flush the cluster. >>>> >>>> The problem is that chunks are written asynchronously.. That's why >>>> this all is not so easy. >>>> >>>>> >>>>> But then I see like 900 new lines of code, and I’m much less >>>>> excited... >>>>> >>>>>> Idea is simple: cache small unaligned write and flush the cluster >>>>>> when >>>>>> filled. >>>>>> >>>>>> Performance result is very good (results in a table is time of >>>>>> compressed backup of 1000M disk filled with ones in seconds): >>>>> >>>>> “Filled with ones” really is an edge case, though. >>>> >>>> Yes, I think, all clusters are compressed to rather small chunks :) >>>> >>>>> >>>>>> --------------- ----------- ----------- >>>>>> backup(old) backup(new) >>>>>> ssd:hdd(direct) 3e+02 4.4 >>>>>> -99% >>>>>> ssd:hdd(cached) 5.7 5.4 >>>>>> -5% >>>>>> --------------- ----------- ----------- >>>>>> >>>>>> So, we have benefit even for cached mode! And the fastest thing is >>>>>> O_DIRECT with new implemented cache. So, I suggest to enable the new >>>>>> cache by default (which is done by the series). >>>>> >>>>> First, I’m not sure how O_DIRECT really is relevant, because I >>>>> don’t really see the point for writing compressed images. >>>> >>>> compressed backup is a point >>> >>> (Perhaps irrelevant, but just to be clear:) I meant the point of >>> using O_DIRECT, which one can decide to not use for backup targets >>> (as you have done already). >>> >>>>> Second, I find it a bit cheating if you say there is a huge >>>>> improvement for the no-cache case, when actually, well, you just >>>>> added a cache. So the no-cache case just became faster because >>>>> there is a cache now. >>>> >>>> Still, performance comparison is relevant to show that O_DIRECT as >>>> is unusable for compressed backup. >>> >>> (Again, perhaps irrelevant, but:) Yes, but my first point was >>> exactly whether O_DIRECT is even relevant for writing compressed >>> images. >>> >>>>> Well, I suppose I could follow that if O_DIRECT doesn’t make much >>>>> sense for compressed images, qemu’s format drivers are free to >>>>> introduce some caching (because technically the cache.direct >>>>> option only applies to the protocol driver) for collecting >>>>> compressed writes. >>>> >>>> Yes I thought in this way, enabling the cache by default. >>>> >>>>> That conclusion makes both of my complaints kind of moot. >>>>> >>>>> *shrug* >>>>> >>>>> Third, what is the real-world impact on the page cache? You >>>>> described that that’s the reason why you need the cache in qemu, >>>>> because otherwise the page cache is polluted too much. How much >>>>> is the difference really? (I don’t know how good the compression >>>>> ratio is for real-world images.) >>>> >>>> Hm. I don't know the ratio.. Customer reported that most of RAM is >>>> polluted by Qemu's cache, and we use O_DIRECT for everything except >>>> for target of compressed backup.. Still the pollution may relate to >>>> several backups and of course it is simple enough to drop the cache >>>> after each backup. But I think that even one backup of 16T disk may >>>> pollute RAM enough. >>> >>> Oh, sorry, I just realized I had a brain fart there. I was >>> referring to whether this series improves the page cache pollution. >>> But obviously it will if it allows you to re-enable O_DIRECT. >>> >>>>> Related to that, I remember a long time ago we had some discussion >>>>> about letting qemu-img convert set a special cache mode for the >>>>> target image that would make Linux drop everything before the last >>>>> offset written (i.e., I suppose fadvise() with >>>>> POSIX_FADV_SEQUENTIAL). You discard that idea based on the fact >>>>> that implementing a cache in qemu would be simple, but it isn’t, >>>>> really. What would the impact of POSIX_FADV_SEQUENTIAL be? (One >>>>> advantage of using that would be that we could reuse it for >>>>> non-compressed images that are written by backup or qemu-img >>>>> convert.) >>>> >>>> The problem is that writes are async. And therefore, not sequential. >>> >>> In theory, yes, but all compressed writes still goes through >>> qcow2_alloc_bytes() right before submitting the write, so I wonder >>> whether in practice the writes aren’t usually sufficiently >>> sequential to make POSIX_FADV_SEQUENTIAL work fine. >> >> Yes, allocation is sequential. But writes are not.. Reasonable, I >> should at least bench it. So we should set POSIX_FADV_SEQUENTIAL for >> the whole backup target before the backup? Will try. Still, I expect >> that my cache will show better performance anyway. Actually, >> comparing cached (by pagecache) vs my cache we have 5.7 -> 4.4, i.e. >> 20% faster, which is significant (still, yes, would be good to check >> it on more real case than all-ones). >> >>> >>>> So >>>> I have to track the writes and wait until the whole cluster is >>>> filled. It's simple use fadvise as an option to my cache: instead >>>> of caching data and write when cluster is filled we can instead >>>> mark cluster POSIX_FADV_DONTNEED. >>>> >>>>> >>>>> (I don’t remember why that qemu-img discussion died back then.) >>>>> >>>>> >>>>> Fourth, regarding the code, would it be simpler if it were a pure >>>>> write cache? I.e., on read, everything is flushed, so we don’t >>>>> have to deal with that. I don’t think there are many valid cases >>>>> where a compressed image is both written to and read from at the >>>>> same time. (Just asking, because I’d really want this code to be >>>>> simpler. I can imagine that reading from the cache is the least >>>>> bit of complexity, but perhaps...) >>>>> >>>> >>>> Hm. I really didn't want to support reads, and do it only to make >>>> it possible to enable the cache by default.. Still read function is >>>> really simple, and I don't think that dropping it will simplify the >>>> code significantly. >>> >>> That’s too bad. >>> >>> So the only question I have left is what POSIX_FADV_SEQUENTIAL >>> actually would do in practice. >> >> will check. >> > > Checked that if I mark the whole file by FADV_SEQUENTIAL, cache is not > removed. > > Test: > [root@kvm fadvise]# cat a.c > #define _GNU_SOURCE > #include <fcntl.h> > #include <unistd.h> > #include <stdio.h> > #include <getopt.h> > #include <string.h> > #include <stdbool.h> > > int main(int argc, char *argv[]) > { > int fd; > int i; > char mb[1024 * 1024]; > int open_flags = O_RDWR | O_CREAT | O_EXCL; > int fadv_flags = 0; > int fadv_final_flags = 0; > int len = 1024 * 1024; > bool do_fsync = false; > > for (i = 1; i < argc - 1; i++) { > const char *arg = argv[i]; > > if (!strcmp(arg, "direct")) { > open_flags |= O_DIRECT; > } else if (!strcmp(arg, "seq")) { > fadv_flags = POSIX_FADV_SEQUENTIAL; > } else if (!strcmp(arg, "dontneed")) { > fadv_flags = POSIX_FADV_DONTNEED; > } else if (!strcmp(arg, "final-dontneed")) { > fadv_final_flags = POSIX_FADV_DONTNEED; > } else if (!strcmp(arg, "fsync")) { > do_fsync = true; > } else { > fprintf(stderr, "unknown: %s\n", arg); > return 1; > } > } > > fd = open(argv[argc - 1], open_flags); > > if (fd < 0) { > fprintf(stderr, "failed to open\n"); > return 1; > } > > if (fadv_flags) { > posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_flags); > } > for (i = 0; i < 100; i++) { > write(fd, mb, len); > } > if (fadv_final_flags) { > posix_fadvise(fd, 0, 100 * 1024 * 1024, fadv_final_flags); > } > if (do_fsync) { > fsync(fd); > } > > close(fd); > } > > > > [root@kvm fadvise]# gcc a.c > [root@kvm fadvise]# rm -f x; ./a.out seq x; fincore x > RES PAGES SIZE FILE > 100M 25600 100M x > [root@kvm fadvise]# rm -f x; ./a.out dontneed x; fincore x > RES PAGES SIZE FILE > 100M 25600 100M x > [root@kvm fadvise]# rm -f x; ./a.out final-dontneed x; fincore x > RES PAGES SIZE FILE > 36M 9216 100M x > [root@kvm fadvise]# rm -f x; ./a.out seq fsync x; fincore x > RES PAGES SIZE FILE > 100M 25600 100M x > [root@kvm fadvise]# rm -f x; ./a.out dontneed fsync x; fincore x > RES PAGES SIZE FILE > 100M 25600 100M x > [root@kvm fadvise]# rm -f x; ./a.out final-dontneed fsync x; fincore x > RES PAGES SIZE FILE > 36M 9216 100M x > [root@kvm fadvise]# rm -f x; ./a.out direct x; fincore x > RES PAGES SIZE FILE > 0B 0 0B x > > > > Backup-generated pagecache is a formal trash, it will be never used. > And it's bad that it can displace another good pagecache. So the best > thing for backup is direct mode + proposed cache. > > I think that the original intention of Max is about POSIX_FADV_DONTNEED One should call this fadvise for just fully written cluster. Though this is a bit buggy and from performance point of view will be slower.
This call could be slow and thus it should be created as delegate to co-routine. We have though on this, but the amount of work to implement even if seems lower, is not really lower and the result would not be as great. Den