On Mon, May 2, 2022 at 8:32 PM Linus Torvalds <torva...@linux-foundation.org> wrote: > On Thu, Apr 28, 2022 at 10:39 AM Andreas Gruenbacher <agrue...@redhat.com> > wrote: > > > > Yes, but note that it's gfs2_file_buffered_write() that fails. When > > the pagefault_disable/enable() around iomap_file_buffered_write() is > > removed, the corruption goes away. > > I looked some more at this on and off, and ended up even more confused. > > For some reason, I'd mostly looked at the read case, because I had > mis-read some of your emails and thought it was the buffered reads > that caused problems. > > Then I went back more carefully, and realized you had always said > gfs2_file_buffered_write() was where the issues happened, and looked > at that path more, and that confused me even *MORE*. > > Because that case has always done the copy from user space with page > faults disabled, because of the traditional deadlock with reading from > user space while holding the page lock on the target page cache page. > > So that is not really about the new deadlock with filesystem locks, > that was fixed by 00bfe02f4796 ("gfs2: Fix mmap + page fault deadlocks > for buffered I/O"). > > So now that I'm looking at the right function (maybe) I'm going "huh", > because it's none of the complex cases that would seem to fail, it's > literally just the fault_in_iov_iter_readable() that we've always done > in iomap_write_iter() that presumably starts failing. > > But *that* old code seems bogus too. It's doing > > if (unlikely(fault_in_iov_iter_readable(i, bytes) == bytes)) { > status = -EFAULT; > break; > } > > which on the face of it is sane: it's saying "if we can't fault in any > bytes, then stop trying". > > And it's good, and correct, but it does leave one case open. > > Because what if the result is "we can fault things in _partially_"? > > The code blithely goes on and tries to do the whole 'bytes' range _anyway_. > > Now, with a bug-free filesystem, this really shouldn't matter, since > the later copy_page_from_iter_atomic() thing should then DTRT anyway, > but this does mean that one fundamental thing that that commit > 00bfe02f4796 changed is that it basically disabled that > fault_in_iov_iter_readable() that *used* to fault in the whole range, > and now potentially only faults in a small area. > > That, in turn, means that in practice it *used* to do "write_end()" > with a fully successful range, ie when it did that > > status = a_ops->write_end(file, mapping, pos, bytes, copied, > page, fsdata); > > then "bytes" and "copied" were the same. > > But now that commit 00bfe02f4796 added the "disable_pagefault()" > around the whole thing, fault_in_iov_iter_readable() will easily fail > half-way instead of bringing the next page in, and then that > ->write_begin() to ->write_end() sequence will see the copy in the > middle failing half-way too, and you'll have that write_end() > condition with the write _partially_ succeeding. > > Which is the complex case for write_end() that you practically > speaking never saw before (it *could* happen with a race with swap-out > or similar, but it was not really something you could trigger in real > life. > > And I suspect this is what bites you with gfs2 > > To *test* that hypothesis, how about you try this attached patch? The > generic_perform_write() function in mm/filemap.c has the same exact > pattern, but as mentioned, a filesystem really needs to be able to > handle the partial write_end() case, so it's not a *bug* in that code, > but it migth be triggering a bug in gfs2. > > And gfs2 only uses the iomap_write_iter() case, I think. So that's the > only case this attached patch changes. > > Again - I think the unpatched iomap_write_iter() code is fine, but I > think it may be what then triggers the real bug in gfs2. So this patch > is not wrong per se, but this patch is basically a "hide the problem" > patch, and it would be very interesting to hear if it does indeed fix > your test-case.
We still get data corruption with the patch applied. The WARN_ON_ONCE(!bytes) doesn't trigger. As an additional experiment, I've added code to check the iterator position that iomap_file_buffered_write() returns, and it's all looking good as well: an iov_iter_advance(orig_from, written) from the original position always gets us to the same iterator. This points at gfs2 getting things wrong after a short write, for example, marking a page / folio uptodate that isn't. But the uptodate handling happens at the iomap layer, so this doesn't leave me with an immediate suspect. We're on filesystems with block size == page size, so none of the struct iomap_page uptodata handling should be involved, either. > Because that would pinpoint exactly what the bug is. > > I'm adding Christoph and Darrick as iomap maintainers here to the > participants (and Dave Chinner in case he's also the temporary > maintainer because Darrick is doing reviews) not because they > necessarily care, but just because this test-patch obviously involves > the iomap code. > > NOTE! This patch is entirely untested. I also didn't actually yet go > look at what gfs2 does when 'bytes' and 'copied' are different. But > since I finally think I figured out what might be going on, I decided > I'd send this out sooner rather than later. > > Because this is the first thing that makes me go "Aaahh.. This might > explain it". > > Linus Thanks, Andreas