On Wed, Apr 27, 2022 at 12:41 PM Andreas Gruenbacher <agrue...@redhat.com> wrote: > > I wonder if this could be documented in the read and write manual > pages. Or would that be asking too much?
I don't think it would be asking too much, since it's basically just describing what Linux has always done in all the major filesystems. Eg look at filemap_read(), which is basically the canonical read function, and note how it doesn't take a single lock at that level. We *do* have synchronization at a page level, though, ie we've always had that page-level "uptodate" bit, of course (ok, so "always" isn't true - back in the distant past it was the 'struct buffer_head' that was the synchronization point). That said, even that is not synchronizing against "new writes", but only against "new creations" (which may, of course, be writers, but is equally likely to be just reading the contents from disk). That said: (a) different filesystems can and will do different things. Not all filesystems use filemap_read() at all, and even the ones that do often have their own wrappers. Such wrappers *can* do extra serialization, and have their own rules. But ext4 does not, for example (see ext4_file_read_iter()). And as mentioned, I *think* XFS honors that old POSIX rule for historical reasons. (b) we do have *different* locking for example, we these days do actually serialize properly on the file->f_pos, which means that a certain *class* of read/write things are atomic wrt each other, because we actually hold that f_pos lock over the whole operation and so if you do file reads and writes using the same file descriptor, they'll be disjoint. That, btw, hasn't always been true. If you had multiple threads using the same file pointer, I think we used to get basically random results. So we have actually strengthened our locking in this area, and made it much better. But note how even if you have the same file descriptor open, and then do pread/pwrite, those can and will happen concurrently. And mmap accesses and modifications are obviously *always* concurrent, even if the fault itself - but not the accesses - might end up being serialized due to some filesystem locking implementation detail. End result: the exact serialization is complex, depends on the filesystem, and is just not really something that should be described or even relied on (eg that f_pos serialization is something we do properly now, but didn't necessarily do in the past, so ..) Is it then worth pointing out one odd POSIX rule that basically nobody but some very low-level filesystem people have ever heard about, and that no version of Linux has ever conformed to in the main default filesystems, and that no user has ever cared about? Linus