On 2/9/21 5:47 PM, Max Reitz wrote: > On 09.02.21 15:10, Vladimir Sementsov-Ogievskiy wrote: >> 09.02.2021 16:25, Max Reitz wrote: >>> On 29.01.21 17:50, Vladimir Sementsov-Ogievskiy wrote: >>>> Hi all! >>>> >>>> I know, I have several series waiting for a resend, but I had to >>>> switch >>>> to another task spawned from our customer's bug. >>>> >>>> Original problem: we use O_DIRECT for all vm images in our product, >>>> it's >>>> the policy. The only exclusion is backup target qcow2 image for >>>> compressed backup, because compressed backup is extremely slow with >>>> O_DIRECT (due to unaligned writes). Customer complains that backup >>>> produces a lot of pagecache. >>>> >>>> So we can either implement some internal cache or use fadvise somehow. >>>> Backup has several async workes, which writes simultaneously, so in >>>> both >>>> ways we have to track host cluster filling (before dropping the cache >>>> corresponding to the cluster). So, if we have to track anyway, let's >>>> try to implement the cache. >>> >>> I wanted to be excited here, because that sounds like it would be >>> very easy to implement caching. Like, just keep the cluster at >>> free_byte_offset cached until the cluster it points to changes, then >>> flush the cluster. >> >> The problem is that chunks are written asynchronously.. That's why >> this all is not so easy. >> >>> >>> But then I see like 900 new lines of code, and I’m much less excited... >>> >>>> Idea is simple: cache small unaligned write and flush the cluster when >>>> filled. >>>> >>>> Performance result is very good (results in a table is time of >>>> compressed backup of 1000M disk filled with ones in seconds): >>> >>> “Filled with ones” really is an edge case, though. >> >> Yes, I think, all clusters are compressed to rather small chunks :) >> >>> >>>> --------------- ----------- ----------- >>>> backup(old) backup(new) >>>> ssd:hdd(direct) 3e+02 4.4 >>>> -99% >>>> ssd:hdd(cached) 5.7 5.4 >>>> -5% >>>> --------------- ----------- ----------- >>>> >>>> So, we have benefit even for cached mode! And the fastest thing is >>>> O_DIRECT with new implemented cache. So, I suggest to enable the new >>>> cache by default (which is done by the series). >>> >>> First, I’m not sure how O_DIRECT really is relevant, because I don’t >>> really see the point for writing compressed images. >> >> compressed backup is a point > > (Perhaps irrelevant, but just to be clear:) I meant the point of using > O_DIRECT, which one can decide to not use for backup targets (as you > have done already). > >>> Second, I find it a bit cheating if you say there is a huge >>> improvement for the no-cache case, when actually, well, you just >>> added a cache. So the no-cache case just became faster because >>> there is a cache now. >> >> Still, performance comparison is relevant to show that O_DIRECT as is >> unusable for compressed backup. > > (Again, perhaps irrelevant, but:) Yes, but my first point was exactly > whether O_DIRECT is even relevant for writing compressed images. > >>> Well, I suppose I could follow that if O_DIRECT doesn’t make much >>> sense for compressed images, qemu’s format drivers are free to >>> introduce some caching (because technically the cache.direct option >>> only applies to the protocol driver) for collecting compressed writes. >> >> Yes I thought in this way, enabling the cache by default. >> >>> That conclusion makes both of my complaints kind of moot. >>> >>> *shrug* >>> >>> Third, what is the real-world impact on the page cache? You >>> described that that’s the reason why you need the cache in qemu, >>> because otherwise the page cache is polluted too much. How much is >>> the difference really? (I don’t know how good the compression ratio >>> is for real-world images.) >> >> Hm. I don't know the ratio.. Customer reported that most of RAM is >> polluted by Qemu's cache, and we use O_DIRECT for everything except >> for target of compressed backup.. Still the pollution may relate to >> several backups and of course it is simple enough to drop the cache >> after each backup. But I think that even one backup of 16T disk may >> pollute RAM enough. > > Oh, sorry, I just realized I had a brain fart there. I was referring > to whether this series improves the page cache pollution. But > obviously it will if it allows you to re-enable O_DIRECT. > >>> Related to that, I remember a long time ago we had some discussion >>> about letting qemu-img convert set a special cache mode for the >>> target image that would make Linux drop everything before the last >>> offset written (i.e., I suppose fadvise() with >>> POSIX_FADV_SEQUENTIAL). You discard that idea based on the fact >>> that implementing a cache in qemu would be simple, but it isn’t, >>> really. What would the impact of POSIX_FADV_SEQUENTIAL be? (One >>> advantage of using that would be that we could reuse it for >>> non-compressed images that are written by backup or qemu-img convert.) >> >> The problem is that writes are async. And therefore, not sequential. > > In theory, yes, but all compressed writes still goes through > qcow2_alloc_bytes() right before submitting the write, so I wonder > whether in practice the writes aren’t usually sufficiently sequential > to make POSIX_FADV_SEQUENTIAL work fine. > >> So >> I have to track the writes and wait until the whole cluster is >> filled. It's simple use fadvise as an option to my cache: instead of >> caching data and write when cluster is filled we can instead mark >> cluster POSIX_FADV_DONTNEED. >> >>> >>> (I don’t remember why that qemu-img discussion died back then.) >>> >>> >>> Fourth, regarding the code, would it be simpler if it were a pure >>> write cache? I.e., on read, everything is flushed, so we don’t have >>> to deal with that. I don’t think there are many valid cases where a >>> compressed image is both written to and read from at the same time. >>> (Just asking, because I’d really want this code to be simpler. I >>> can imagine that reading from the cache is the least bit of >>> complexity, but perhaps...) >>> >> >> Hm. I really didn't want to support reads, and do it only to make it >> possible to enable the cache by default.. Still read function is >> really simple, and I don't think that dropping it will simplify the >> code significantly. > > That’s too bad. > > So the only question I have left is what POSIX_FADV_SEQUENTIAL > actually would do in practice. > > (But even then, the premise just doesn’t motivate me sufficiently yet...) > POSIX_FADV_SEQUENTIAL influences the amount of the read-ahead only.
int generic_fadvise(struct file *file, loff_t offset, loff_t len, int advice) { ..... case POSIX_FADV_NORMAL: file->f_ra.ra_pages = bdi->ra_pages; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); break; case POSIX_FADV_SEQUENTIAL: file->f_ra.ra_pages = bdi->ra_pages * 2; spin_lock(&file->f_lock); file->f_mode &= ~FMODE_RANDOM; spin_unlock(&file->f_lock); break; thus the only difference from the usual NORMAL mode would be doubled amount of read-ahead pages. Den