Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote: > A cache that isn't actually consistent is a _bug_. You're being > Obsequious. And any time this has come up in previous discussions > (including at LSF), that was never up for debate, the only question has > been whether it was even possible to practically fix it. That is not my impression. But again, if you think it is useful, go ahead and seel people on the idea. But please prepare a series that includes the rationale, performance tradeoffs and real live implications for it. And do it on the existing code that people use and not just your shiny new thing.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Fri, May 26, 2023 at 01:10:43AM -0700, Christoph Hellwig wrote: > On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote: > > > > I'm absolutely not in favour to add workarounds for thes kind of locking > > > > problems to the core kernel. I already feel bad for allowing the > > > > small workaround in iomap for btrfs, as just fixing the locking back > > > > then would have avoid massive ratholing. > > > > > > Please let me know when those btrfs changes are in a presentable shape ... > > > > I would also be curious to know what btrfs needs and what the approach > > is there. > > btrfs has the extent locked, where "extent locked" is a somewhat magic > range lock that actually includes different lock bits. It does so > because it clears the page writeback bit when the data made it to the > media, but before the metadata required to find it is commited, and > the extent lock prevents it from trying to do a readpage on something > that has actually very recently been written back but not fully > commited. Once btrfs is changed to only clear the page writeback bit > once the write is fully commited like in other file systems this extra > level of locking can go away, and there are no more locks in the > readpage path that are also taken by the direct I/O code. With that > a lot of code in btrfs working around this can go away, including the > no fault direct I/O code. wow, yeah, that is not how that is supposed to work...
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Fri, May 26, 2023 at 01:06:46AM -0700, Christoph Hellwig wrote: > On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote: > > A cache that isn't actually consistent is a _bug_. You're being > > Obsequious. And any time this has come up in previous discussions > > (including at LSF), that was never up for debate, the only question has > > been whether it was even possible to practically fix it. > > That is not my impression. But again, if you think it is useful, > go ahead and seel people on the idea. But please prepare a series > that includes the rationale, performance tradeoffs and real live > implications for it. And do it on the existing code that people use > and not just your shiny new thing. When I'm ready to lift this to the VFS level I will; it should simplify locking overall and it'll be one less thing for people to worry about. (i.e. the fact that even _readahead_ can pull in pages a dio is invalidating is a really nice footgun if not worked around). Right now though I've got more than enough on my plate just trying to finally get bcachefs merged, I'm happy to explain what this is for but I'm not ready for additional headaches or projects yet.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote: > > > I'm absolutely not in favour to add workarounds for thes kind of locking > > > problems to the core kernel. I already feel bad for allowing the > > > small workaround in iomap for btrfs, as just fixing the locking back > > > then would have avoid massive ratholing. > > > > Please let me know when those btrfs changes are in a presentable shape ... > > I would also be curious to know what btrfs needs and what the approach > is there. btrfs has the extent locked, where "extent locked" is a somewhat magic range lock that actually includes different lock bits. It does so because it clears the page writeback bit when the data made it to the media, but before the metadata required to find it is commited, and the extent lock prevents it from trying to do a readpage on something that has actually very recently been written back but not fully commited. Once btrfs is changed to only clear the page writeback bit once the write is fully commited like in other file systems this extra level of locking can go away, and there are no more locks in the readpage path that are also taken by the direct I/O code. With that a lot of code in btrfs working around this can go away, including the no fault direct I/O code.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Fri, May 26, 2023 at 02:05:26AM +0200, Andreas Grünbacher wrote: > Oh, it's just that gfs2 uses one dlm lock per inode to control access > to that inode. In the code, this is called the "inode glock" --- > glocks being an abstraction above dlm locks --- but it boils down to > dlm locks in the end. An additional layer of locking will only work > correctly if all cluster nodes use the new locks consistently, so old > cluster nodes will become incompatible. Those kinds of changes are > hard. > > But the additional lock taking would also hurt performance, forever, > and I'd really like to avoid taking that hit. > > It may not be obvious to everyone, but allowing page faults during > reads and writes (i.e., while holding dlm locks) can lead to > distributed deadlocks. We cannot just pretend to be a local > filesystem. Ah, gotcha. Same basic issue then, dio -> page fault means you're taking two DLM locks simulateously, so without lock ordering you get deadlock. So that means promoting this to the VFS won't be be useful to you (because the VFS lock will be a local lock, not your DLM lock). Do you have trylock() and a defined lock ordering you can check for or glocks (inode number)? Here's the new and expanded commit message, in case my approach is of interest to you: commit 32858d0fb90b503c8c39e899ad5155e44639d3b5 Author: Kent Overstreet Date: Wed Oct 16 15:03:50 2019 -0400 sched: Add task_struct->faults_disabled_mapping There has been a long standing page cache coherence bug with direct IO. This provides part of a mechanism to fix it, currently just used by bcachefs but potentially worth promoting to the VFS. Direct IO evicts the range of the pagecache being read or written to. For reads, we need dirty pages to be written to disk, so that the read doesn't return stale data. For writes, we need to evict that range of the pagecache so that it's not stale after the write completes. However, without a locking mechanism to prevent those pages from being re-added to the pagecache - by a buffered read or page fault - page cache inconsistency is still possible. This isn't necessarily just an issue for userspace when they're playing games; filesystems may hang arbitrary state off the pagecache, and so page cache inconsistency may cause real filesystem bugs, depending on the filesystem. This is less of an issue for iomap based filesystems, but e.g. buffer heads caches disk block mappings (!) and attaches them to the pagecache, and bcachefs attaches disk reservations to pagecache pages. This issue has been hard to fix, because - we need to add a lock (henceforth calld pagecache_add_lock), which would be held for the duration of the direct IO - page faults add pages to the page cache, thus need to take the same lock - dio -> gup -> page fault thus can deadlock And we cannot enforce a lock ordering with this lock, since userspace will be controlling the lock ordering (via the fd and buffer arguments to direct IOs), so we need a different method of deadlock avoidance. We need to tell the page fault handler that we're already holding a pagecache_add_lock, and since plumbing it through the entire gup() path would be highly impractical this adds a field to task_struct. Then the full method is: - in the dio path, when we take first pagecache_add_lock, note the mapping in task_struct - in the page fault handler, if faults_disabled_mapping is set, we check if it's the same mapping as the one taking a page fault for, and if so return an error. Then we check lock ordering: if there's a lock ordering violation and trylock fails, we'll have to cycle the locks and return an error that tells the DIO path to retry: faults_disabled_mapping is also used for signalling "locks were dropped, please retry". Also relevant to this patch: mapping->invalidate_lock. mapping->invalidate_lock provides most of the required semantics - it's used by truncate/fallocate to block pages being added to the pagecache. However, since it's a rwsem, direct IOs would need to take the write side in order to block page cache adds, and would then be exclusive with each other - we'll need a new type of lock to pair with this approach. Signed-off-by: Kent Overstreet Cc: Jan Kara Cc: Darrick J. Wong Cc: linux-fsde...@vger.kernel.org Cc: Andreas Grünbacher
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
Am Fr., 26. Mai 2023 um 01:20 Uhr schrieb Kent Overstreet : > On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote: > > Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig > > : > > > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote: > > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a > > > > different > > > > way (by prefaulting pages from the iter before grabbing the problematic > > > > lock and then disabling page faults for the iomap_dio_rw() call). I > > > > guess > > > > we should somehow unify these schemes so that we don't have two > > > > mechanisms > > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > > > > > > > Also good that you've written a fstest for this, that is definitely a > > > > useful > > > > addition, although I suspect GFS2 guys added a test for this not so long > > > > ago when testing their stuff. Maybe they have a pointer handy? > > > > > > generic/708 is the btrfs version of this. > > > > > > But I think all of the file systems that have this deadlock are actually > > > fundamentally broken because they have a mess up locking hierarchy > > > where page faults take the same lock that is held over the the direct I/ > > > operation. And the right thing is to fix this. I have work in progress > > > for btrfs, and something similar should apply to gfs2, with the added > > > complication that it probably means a revision to their network > > > protocol. > > > > We do disable page faults, and there can be deadlocks in page fault > > handlers while no page faults are allowed. > > > > I'm roughly aware of the locking hierarchy that other filesystems use, > > and that's something we want to avoid because of two reasons: (1) it > > would be an incompatible change, and (2) we want to avoid cluster-wide > > locking operations as much as possible because they are very slow. > > > > These kinds of locking conflicts are so rare in practice that the > > theoretical inefficiency of having to retry the operation doesn't > > matter. > > Would you be willing to expand on that? I'm wondering if this would > simplify things for gfs2, but you mention locking heirarchy being an > incompatible change - how does that work? Oh, it's just that gfs2 uses one dlm lock per inode to control access to that inode. In the code, this is called the "inode glock" --- glocks being an abstraction above dlm locks --- but it boils down to dlm locks in the end. An additional layer of locking will only work correctly if all cluster nodes use the new locks consistently, so old cluster nodes will become incompatible. Those kinds of changes are hard. But the additional lock taking would also hurt performance, forever, and I'd really like to avoid taking that hit. It may not be obvious to everyone, but allowing page faults during reads and writes (i.e., while holding dlm locks) can lead to distributed deadlocks. We cannot just pretend to be a local filesystem. Thanks, Andreas > > > I'm absolutely not in favour to add workarounds for thes kind of locking > > > problems to the core kernel. I already feel bad for allowing the > > > small workaround in iomap for btrfs, as just fixing the locking back > > > then would have avoid massive ratholing. > > > > Please let me know when those btrfs changes are in a presentable shape ... > > I would also be curious to know what btrfs needs and what the approach > is there.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote: > Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig > : > > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote: > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > > > way (by prefaulting pages from the iter before grabbing the problematic > > > lock and then disabling page faults for the iomap_dio_rw() call). I guess > > > we should somehow unify these schemes so that we don't have two mechanisms > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > > > > > Also good that you've written a fstest for this, that is definitely a > > > useful > > > addition, although I suspect GFS2 guys added a test for this not so long > > > ago when testing their stuff. Maybe they have a pointer handy? > > > > generic/708 is the btrfs version of this. > > > > But I think all of the file systems that have this deadlock are actually > > fundamentally broken because they have a mess up locking hierarchy > > where page faults take the same lock that is held over the the direct I/ > > operation. And the right thing is to fix this. I have work in progress > > for btrfs, and something similar should apply to gfs2, with the added > > complication that it probably means a revision to their network > > protocol. > > We do disable page faults, and there can be deadlocks in page fault > handlers while no page faults are allowed. > > I'm roughly aware of the locking hierarchy that other filesystems use, > and that's something we want to avoid because of two reasons: (1) it > would be an incompatible change, and (2) we want to avoid cluster-wide > locking operations as much as possible because they are very slow. > > These kinds of locking conflicts are so rare in practice that the > theoretical inefficiency of having to retry the operation doesn't > matter. Would you be willing to expand on that? I'm wondering if this would simplify things for gfs2, but you mention locking heirarchy being an incompatible change - how does that work? > > > I'm absolutely not in favour to add workarounds for thes kind of locking > > problems to the core kernel. I already feel bad for allowing the > > small workaround in iomap for btrfs, as just fixing the locking back > > then would have avoid massive ratholing. > > Please let me know when those btrfs changes are in a presentable shape ... I would also be curious to know what btrfs needs and what the approach is there.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
Am Do., 25. Mai 2023 um 10:56 Uhr schrieb Jan Kara : > On Tue 23-05-23 12:49:06, Kent Overstreet wrote: > > > > No, that's definitely handled (and you can see it in the code I linked), > > > > and I wrote a torture test for fstests as well. > > > > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > > > way (by prefaulting pages from the iter before grabbing the problematic > > > lock and then disabling page faults for the iomap_dio_rw() call). I guess > > > we should somehow unify these schemes so that we don't have two mechanisms > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > > > Oof, that sounds a bit sketchy. What happens if the dio call passes in > > an address from the same address space? > > If we submit direct IO that uses mapped file F at offset O as a buffer for > direct IO from file F, offset O, it will currently livelock in an > indefinite retry loop. It should rather return error or fall back to > buffered IO. But that should be fixable. Andreas? Yes, I guess. Thanks for the heads-up. Andreas > But if the buffer and direct IO range does not overlap, it will just > happily work - iomap_dio_rw() invalidates only the range direct IO is done > to. > > > What happens if we race with the pages we faulted in being evicted? > > We fault them in again and retry. > > > > Also good that you've written a fstest for this, that is definitely a > > > useful > > > addition, although I suspect GFS2 guys added a test for this not so long > > > ago when testing their stuff. Maybe they have a pointer handy? > > > > More tests more good. > > > > So if we want to lift this scheme to the VFS layer, we'd start by > > replacing the lock you added (grepping for it, the name escapes me) with > > a different type of lock - two_state_shared_lock in my code, it's like a > > rw lock except writers don't exclude other writers. That way the DIO > > path can use it without singlethreading writes to a single file. > > Yes, I've noticed that you are introducing in bcachefs a lock with very > similar semantics to mapping->invalidate_lock, just with this special lock > type. What I'm kind of worried about with two_state_shared_lock as > implemented in bcachefs is the fairness. AFAICS so far if someone is e.g. > heavily faulting pages on a file, direct IO to that file can be starved > indefinitely. That is IMHO not a good thing and I would not like to use > this type of lock in VFS until this problem is resolved. But it should be > fixable e.g. by introducing some kind of deadline for a waiter after which > it will block acquisitions of the other lock state. > > Honza > -- > Jan Kara > SUSE Labs, CR
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig : > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote: > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > > way (by prefaulting pages from the iter before grabbing the problematic > > lock and then disabling page faults for the iomap_dio_rw() call). I guess > > we should somehow unify these schemes so that we don't have two mechanisms > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > > > Also good that you've written a fstest for this, that is definitely a useful > > addition, although I suspect GFS2 guys added a test for this not so long > > ago when testing their stuff. Maybe they have a pointer handy? > > generic/708 is the btrfs version of this. > > But I think all of the file systems that have this deadlock are actually > fundamentally broken because they have a mess up locking hierarchy > where page faults take the same lock that is held over the the direct I/ > operation. And the right thing is to fix this. I have work in progress > for btrfs, and something similar should apply to gfs2, with the added > complication that it probably means a revision to their network > protocol. We do disable page faults, and there can be deadlocks in page fault handlers while no page faults are allowed. I'm roughly aware of the locking hierarchy that other filesystems use, and that's something we want to avoid because of two reasons: (1) it would be an incompatible change, and (2) we want to avoid cluster-wide locking operations as much as possible because they are very slow. These kinds of locking conflicts are so rare in practice that the theoretical inefficiency of having to retry the operation doesn't matter. > I'm absolutely not in favour to add workarounds for thes kind of locking > problems to the core kernel. I already feel bad for allowing the > small workaround in iomap for btrfs, as just fixing the locking back > then would have avoid massive ratholing. Please let me know when those btrfs changes are in a presentable shape ... Thanks, Andreas
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
Am Di., 23. Mai 2023 um 15:37 Uhr schrieb Jan Kara : > On Wed 10-05-23 02:18:45, Kent Overstreet wrote: > > On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote: > > > On Tue 09-05-23 12:56:31, Kent Overstreet wrote: > > > > From: Kent Overstreet > > > > > > > > This is used by bcachefs to fix a page cache coherency issue with > > > > O_DIRECT writes. > > > > > > > > Also relevant: mapping->invalidate_lock, see below. > > > > > > > > O_DIRECT writes (and other filesystem operations that modify file data > > > > while bypassing the page cache) need to shoot down ranges of the page > > > > cache - and additionally, need locking to prevent those pages from > > > > pulled back in. > > > > > > > > But O_DIRECT writes invoke the page fault handler (via get_user_pages), > > > > and the page fault handler will need to take that same lock - this is a > > > > classic recursive deadlock if userspace has mmaped the file they're DIO > > > > writing to and uses those pages for the buffer to write from, and it's a > > > > lock ordering deadlock in general. > > > > > > > > Thus we need a way to signal from the dio code to the page fault handler > > > > when we already are holding the pagecache add lock on an address space - > > > > this patch just adds a member to task_struct for this purpose. For now > > > > only bcachefs is implementing this locking, though it may be moved out > > > > of bcachefs and made available to other filesystems in the future. > > > > > > It would be nice to have at least a link to the code that's actually using > > > the field you are adding. > > > > Bit of a trick to link to a _later_ patch in the series from a commit > > message, but... > > > > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975 > > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454 > > Thanks and I'm sorry for the delay. > > > > Also I think we were already through this discussion [1] and we ended up > > > agreeing that your scheme actually solves only the AA deadlock but a > > > malicious userspace can easily create AB BA deadlock by running direct IO > > > to file A using mapped file B as a buffer *and* direct IO to file B using > > > mapped file A as a buffer. > > > > No, that's definitely handled (and you can see it in the code I linked), > > and I wrote a torture test for fstests as well. > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > way (by prefaulting pages from the iter before grabbing the problematic > lock and then disabling page faults for the iomap_dio_rw() call). I guess > we should somehow unify these schemes so that we don't have two mechanisms > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > Also good that you've written a fstest for this, that is definitely a useful > addition, although I suspect GFS2 guys added a test for this not so long > ago when testing their stuff. Maybe they have a pointer handy? Ah yes, that's xfstests commit d3cbdabf ("generic: Test page faults during read and write"). Thanks, Andreas > Honza > -- > Jan Kara > SUSE Labs, CR
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote: > On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote: > > > Well, it seems like you are talking about something else than the > > > existing cases in gfs2 and btrfs, that is you want full consistency > > > between direct I/O and buffered I/O. That's something nothing in the > > > kernel has ever provided, so I'd be curious why you think you need it > > > and want different semantics from everyone else? > > > > Because I like code that is correct. > > Well, start with explaining your definition of correctness, why everyone > else is "not correct", an how you can help fixing this correctness > problem in the existing kernel. Thanks for your cooperation! BTW, if you wanted a more serious answer, just asking for the commit message to be expanded would be a better way to ask...
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Thu, May 25, 2023 at 10:47:31AM +0200, Jan Kara wrote: > If we submit direct IO that uses mapped file F at offset O as a buffer for > direct IO from file F, offset O, it will currently livelock in an > indefinite retry loop. It should rather return error or fall back to > buffered IO. But that should be fixable. Andreas? > > But if the buffer and direct IO range does not overlap, it will just > happily work - iomap_dio_rw() invalidates only the range direct IO is done > to. *nod* readahead triggered from the page fault path is another consideration. No idea how that interacts with the gf2s method; IIRC there's a hack in the page fault path that says somewhere "we may be getting called via gup(), don't invoke readahead". We could potentially kill that hack if we lifted this to the VFS layer. > > > What happens if we race with the pages we faulted in being evicted? > > We fault them in again and retry. > > > > Also good that you've written a fstest for this, that is definitely a > > > useful > > > addition, although I suspect GFS2 guys added a test for this not so long > > > ago when testing their stuff. Maybe they have a pointer handy? > > > > More tests more good. > > > > So if we want to lift this scheme to the VFS layer, we'd start by > > replacing the lock you added (grepping for it, the name escapes me) with > > a different type of lock - two_state_shared_lock in my code, it's like a > > rw lock except writers don't exclude other writers. That way the DIO > > path can use it without singlethreading writes to a single file. > > Yes, I've noticed that you are introducing in bcachefs a lock with very > similar semantics to mapping->invalidate_lock, just with this special lock > type. What I'm kind of worried about with two_state_shared_lock as > implemented in bcachefs is the fairness. AFAICS so far if someone is e.g. > heavily faulting pages on a file, direct IO to that file can be starved > indefinitely. That is IMHO not a good thing and I would not like to use > this type of lock in VFS until this problem is resolved. But it should be > fixable e.g. by introducing some kind of deadline for a waiter after which > it will block acquisitions of the other lock state. Yeah, my two_state_shared lock is definitely at the quick and dirty prototype level, the implementation would need work. Lockdep support would be another hard requirement. The deadline might be a good idea, OTOH it'd want tuning. Maybe something like what rwsem does where we block new read acquirerers if there's a writer waiting would work.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote: > On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote: > > > Well, it seems like you are talking about something else than the > > > existing cases in gfs2 and btrfs, that is you want full consistency > > > between direct I/O and buffered I/O. That's something nothing in the > > > kernel has ever provided, so I'd be curious why you think you need it > > > and want different semantics from everyone else? > > > > Because I like code that is correct. > > Well, start with explaining your definition of correctness, why everyone > else is "not correct", an how you can help fixing this correctness > problem in the existing kernel. Thanks for your cooperation! A cache that isn't actually consistent is a _bug_. You're being Obsequious. And any time this has come up in previous discussions (including at LSF), that was never up for debate, the only question has been whether it was even possible to practically fix it. The DIO code recognizes cache incoherency as something to be avoided by shooting down the page cache both at the beginning of the IO _and again at the end_. That's the kind of obvious hackery for a race condition that we would like to avoid. Regarding the consequences of this kind of bug - stale data exposed to userspace, possibly stale data overwriting a write we acked, and worse any filesystem state that hangs off the page cache being inconsistent with the data on disk. And look, we've been over all this before, so I don't see what this adds to the discussion.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote: > > Well, it seems like you are talking about something else than the > > existing cases in gfs2 and btrfs, that is you want full consistency > > between direct I/O and buffered I/O. That's something nothing in the > > kernel has ever provided, so I'd be curious why you think you need it > > and want different semantics from everyone else? > > Because I like code that is correct. Well, start with explaining your definition of correctness, why everyone else is "not correct", an how you can help fixing this correctness problem in the existing kernel. Thanks for your cooperation!
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Tue 23-05-23 12:49:06, Kent Overstreet wrote: > > > No, that's definitely handled (and you can see it in the code I linked), > > > and I wrote a torture test for fstests as well. > > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > > way (by prefaulting pages from the iter before grabbing the problematic > > lock and then disabling page faults for the iomap_dio_rw() call). I guess > > we should somehow unify these schemes so that we don't have two mechanisms > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > Oof, that sounds a bit sketchy. What happens if the dio call passes in > an address from the same address space? If we submit direct IO that uses mapped file F at offset O as a buffer for direct IO from file F, offset O, it will currently livelock in an indefinite retry loop. It should rather return error or fall back to buffered IO. But that should be fixable. Andreas? But if the buffer and direct IO range does not overlap, it will just happily work - iomap_dio_rw() invalidates only the range direct IO is done to. > What happens if we race with the pages we faulted in being evicted? We fault them in again and retry. > > Also good that you've written a fstest for this, that is definitely a useful > > addition, although I suspect GFS2 guys added a test for this not so long > > ago when testing their stuff. Maybe they have a pointer handy? > > More tests more good. > > So if we want to lift this scheme to the VFS layer, we'd start by > replacing the lock you added (grepping for it, the name escapes me) with > a different type of lock - two_state_shared_lock in my code, it's like a > rw lock except writers don't exclude other writers. That way the DIO > path can use it without singlethreading writes to a single file. Yes, I've noticed that you are introducing in bcachefs a lock with very similar semantics to mapping->invalidate_lock, just with this special lock type. What I'm kind of worried about with two_state_shared_lock as implemented in bcachefs is the fairness. AFAICS so far if someone is e.g. heavily faulting pages on a file, direct IO to that file can be starved indefinitely. That is IMHO not a good thing and I would not like to use this type of lock in VFS until this problem is resolved. But it should be fixable e.g. by introducing some kind of deadline for a waiter after which it will block acquisitions of the other lock state. Honza -- Jan Kara SUSE Labs, CR
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Tue, May 23, 2023 at 11:43:32PM -0700, Christoph Hellwig wrote: > On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote: > > No, this is fundamentally because userspace controls the ordering of > > locking because the buffer passed to dio can point into any address > > space. You can't solve this by changing the locking heirarchy. > > > > If you want to be able to have locking around adding things to the > > pagecache so that things that bypass the pagecache can prevent > > inconsistencies (and we do, the big one is fcollapse), and if you want > > dio to be able to use that same locking (because otherwise dio will also > > cause page cache inconsistency), this is the way to do it. > > Well, it seems like you are talking about something else than the > existing cases in gfs2 and btrfs, that is you want full consistency > between direct I/O and buffered I/O. That's something nothing in the > kernel has ever provided, so I'd be curious why you think you need it > and want different semantics from everyone else? Because I like code that is correct.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote: > No, this is fundamentally because userspace controls the ordering of > locking because the buffer passed to dio can point into any address > space. You can't solve this by changing the locking heirarchy. > > If you want to be able to have locking around adding things to the > pagecache so that things that bypass the pagecache can prevent > inconsistencies (and we do, the big one is fcollapse), and if you want > dio to be able to use that same locking (because otherwise dio will also > cause page cache inconsistency), this is the way to do it. Well, it seems like you are talking about something else than the existing cases in gfs2 and btrfs, that is you want full consistency between direct I/O and buffered I/O. That's something nothing in the kernel has ever provided, so I'd be curious why you think you need it and want different semantics from everyone else?
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote: > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > way (by prefaulting pages from the iter before grabbing the problematic > lock and then disabling page faults for the iomap_dio_rw() call). I guess > we should somehow unify these schemes so that we don't have two mechanisms > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > Also good that you've written a fstest for this, that is definitely a useful > addition, although I suspect GFS2 guys added a test for this not so long > ago when testing their stuff. Maybe they have a pointer handy? generic/708 is the btrfs version of this. But I think all of the file systems that have this deadlock are actually fundamentally broken because they have a mess up locking hierarchy where page faults take the same lock that is held over the the direct I/ operation. And the right thing is to fix this. I have work in progress for btrfs, and something similar should apply to gfs2, with the added complication that it probably means a revision to their network protocol. I'm absolutely not in favour to add workarounds for thes kind of locking problems to the core kernel. I already feel bad for allowing the small workaround in iomap for btrfs, as just fixing the locking back then would have avoid massive ratholing.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
> > No, that's definitely handled (and you can see it in the code I linked), > > and I wrote a torture test for fstests as well. > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > way (by prefaulting pages from the iter before grabbing the problematic > lock and then disabling page faults for the iomap_dio_rw() call). I guess > we should somehow unify these schemes so that we don't have two mechanisms > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. Oof, that sounds a bit sketchy. What happens if the dio call passes in an address from the same address space? What happens if we race with the pages we faulted in being evicted? > Also good that you've written a fstest for this, that is definitely a useful > addition, although I suspect GFS2 guys added a test for this not so long > ago when testing their stuff. Maybe they have a pointer handy? More tests more good. So if we want to lift this scheme to the VFS layer, we'd start by replacing the lock you added (grepping for it, the name escapes me) with a different type of lock - two_state_shared_lock in my code, it's like a rw lock except writers don't exclude other writers. That way the DIO path can use it without singlethreading writes to a single file.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Tue, May 23, 2023 at 09:21:56AM -0700, Christoph Hellwig wrote: > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote: > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different > > way (by prefaulting pages from the iter before grabbing the problematic > > lock and then disabling page faults for the iomap_dio_rw() call). I guess > > we should somehow unify these schemes so that we don't have two mechanisms > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC. > > > > Also good that you've written a fstest for this, that is definitely a useful > > addition, although I suspect GFS2 guys added a test for this not so long > > ago when testing their stuff. Maybe they have a pointer handy? > > generic/708 is the btrfs version of this. > > But I think all of the file systems that have this deadlock are actually > fundamentally broken because they have a mess up locking hierarchy > where page faults take the same lock that is held over the the direct I/ > operation. And the right thing is to fix this. I have work in progress > for btrfs, and something similar should apply to gfs2, with the added > complication that it probably means a revision to their network > protocol. No, this is fundamentally because userspace controls the ordering of locking because the buffer passed to dio can point into any address space. You can't solve this by changing the locking heirarchy. If you want to be able to have locking around adding things to the pagecache so that things that bypass the pagecache can prevent inconsistencies (and we do, the big one is fcollapse), and if you want dio to be able to use that same locking (because otherwise dio will also cause page cache inconsistency), this is the way to do it.
Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
On Wed 10-05-23 02:18:45, Kent Overstreet wrote: > On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote: > > On Tue 09-05-23 12:56:31, Kent Overstreet wrote: > > > From: Kent Overstreet > > > > > > This is used by bcachefs to fix a page cache coherency issue with > > > O_DIRECT writes. > > > > > > Also relevant: mapping->invalidate_lock, see below. > > > > > > O_DIRECT writes (and other filesystem operations that modify file data > > > while bypassing the page cache) need to shoot down ranges of the page > > > cache - and additionally, need locking to prevent those pages from > > > pulled back in. > > > > > > But O_DIRECT writes invoke the page fault handler (via get_user_pages), > > > and the page fault handler will need to take that same lock - this is a > > > classic recursive deadlock if userspace has mmaped the file they're DIO > > > writing to and uses those pages for the buffer to write from, and it's a > > > lock ordering deadlock in general. > > > > > > Thus we need a way to signal from the dio code to the page fault handler > > > when we already are holding the pagecache add lock on an address space - > > > this patch just adds a member to task_struct for this purpose. For now > > > only bcachefs is implementing this locking, though it may be moved out > > > of bcachefs and made available to other filesystems in the future. > > > > It would be nice to have at least a link to the code that's actually using > > the field you are adding. > > Bit of a trick to link to a _later_ patch in the series from a commit > message, but... > > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975 > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454 Thanks and I'm sorry for the delay. > > Also I think we were already through this discussion [1] and we ended up > > agreeing that your scheme actually solves only the AA deadlock but a > > malicious userspace can easily create AB BA deadlock by running direct IO > > to file A using mapped file B as a buffer *and* direct IO to file B using > > mapped file A as a buffer. > > No, that's definitely handled (and you can see it in the code I linked), > and I wrote a torture test for fstests as well. I've checked the code and AFAICT it is all indeed handled. BTW, I've now remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25 ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different way (by prefaulting pages from the iter before grabbing the problematic lock and then disabling page faults for the iomap_dio_rw() call). I guess we should somehow unify these schemes so that we don't have two mechanisms for avoiding exactly the same deadlock. Adding GFS2 guys to CC. Also good that you've written a fstest for this, that is definitely a useful addition, although I suspect GFS2 guys added a test for this not so long ago when testing their stuff. Maybe they have a pointer handy? Honza -- Jan Kara SUSE Labs, CR