Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-26 Thread Christoph Hellwig
On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote:
> A cache that isn't actually consistent is a _bug_. You're being
> Obsequious. And any time this has come up in previous discussions
> (including at LSF), that was never up for debate, the only question has
> been whether it was even possible to practically fix it.

That is not my impression.  But again, if you think it is useful,
go ahead and seel people on the idea.  But please prepare a series
that includes the rationale, performance tradeoffs and real live
implications for it.  And do it on the existing code that people use
and not just your shiny new thing.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-26 Thread Kent Overstreet
On Fri, May 26, 2023 at 01:10:43AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote:
> > > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > > problems to the core kernel.  I already feel bad for allowing the
> > > > small workaround in iomap for btrfs, as just fixing the locking back
> > > > then would have avoid massive ratholing.
> > > 
> > > Please let me know when those btrfs changes are in a presentable shape ...
> > 
> > I would also be curious to know what btrfs needs and what the approach
> > is there.
> 
> btrfs has the extent locked, where "extent locked" is a somewhat magic
> range lock that actually includes different lock bits.  It does so
> because it clears the page writeback bit when the data made it to the
> media, but before the metadata required to find it is commited, and
> the extent lock prevents it from trying to do a readpage on something
> that has actually very recently been written back but not fully
> commited.  Once btrfs is changed to only clear the page writeback bit
> once the write is fully commited like in other file systems this extra
> level of locking can go away, and there are no more locks in the
> readpage path that are also taken by the direct I/O code.  With that
> a lot of code in btrfs working around this can go away, including the
> no fault direct I/O code.

wow, yeah, that is not how that is supposed to work...



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-26 Thread Kent Overstreet
On Fri, May 26, 2023 at 01:06:46AM -0700, Christoph Hellwig wrote:
> On Thu, May 25, 2023 at 04:50:39PM -0400, Kent Overstreet wrote:
> > A cache that isn't actually consistent is a _bug_. You're being
> > Obsequious. And any time this has come up in previous discussions
> > (including at LSF), that was never up for debate, the only question has
> > been whether it was even possible to practically fix it.
> 
> That is not my impression.  But again, if you think it is useful,
> go ahead and seel people on the idea.  But please prepare a series
> that includes the rationale, performance tradeoffs and real live
> implications for it.  And do it on the existing code that people use
> and not just your shiny new thing.

When I'm ready to lift this to the VFS level I will; it should simplify
locking overall and it'll be one less thing for people to worry about.

(i.e. the fact that even _readahead_ can pull in pages a dio is
invalidating is a really nice footgun if not worked around).

Right now though I've got more than enough on my plate just trying to
finally get bcachefs merged, I'm happy to explain what this is for but
I'm not ready for additional headaches or projects yet.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-26 Thread Christoph Hellwig
On Thu, May 25, 2023 at 07:20:46PM -0400, Kent Overstreet wrote:
> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel.  I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> > 
> > Please let me know when those btrfs changes are in a presentable shape ...
> 
> I would also be curious to know what btrfs needs and what the approach
> is there.

btrfs has the extent locked, where "extent locked" is a somewhat magic
range lock that actually includes different lock bits.  It does so
because it clears the page writeback bit when the data made it to the
media, but before the metadata required to find it is commited, and
the extent lock prevents it from trying to do a readpage on something
that has actually very recently been written back but not fully
commited.  Once btrfs is changed to only clear the page writeback bit
once the write is fully commited like in other file systems this extra
level of locking can go away, and there are no more locks in the
readpage path that are also taken by the direct I/O code.  With that
a lot of code in btrfs working around this can go away, including the
no fault direct I/O code.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Kent Overstreet
On Fri, May 26, 2023 at 02:05:26AM +0200, Andreas Grünbacher wrote:
> Oh, it's just that gfs2 uses one dlm lock per inode to control access
> to that inode. In the code, this is called the "inode glock" ---
> glocks being an abstraction above dlm locks --- but it boils down to
> dlm locks in the end. An additional layer of locking will only work
> correctly if all cluster nodes use the new locks consistently, so old
> cluster nodes will become incompatible. Those kinds of changes are
> hard.
> 
> But the additional lock taking would also hurt performance, forever,
> and I'd really like to avoid taking that hit.
> 
> It may not be obvious to everyone, but allowing page faults during
> reads and writes (i.e., while holding dlm locks) can lead to
> distributed deadlocks. We cannot just pretend to be a local
> filesystem.

Ah, gotcha. Same basic issue then, dio -> page fault means you're taking
two DLM locks simulateously, so without lock ordering you get deadlock.

So that means promoting this to the VFS won't be be useful to you
(because the VFS lock will be a local lock, not your DLM lock).

Do you have trylock() and a defined lock ordering you can check for or
glocks (inode number)? Here's the new and expanded commit message, in
case my approach is of interest to you:

commit 32858d0fb90b503c8c39e899ad5155e44639d3b5
Author: Kent Overstreet 
Date:   Wed Oct 16 15:03:50 2019 -0400

sched: Add task_struct->faults_disabled_mapping

There has been a long standing page cache coherence bug with direct IO.
This provides part of a mechanism to fix it, currently just used by
bcachefs but potentially worth promoting to the VFS.

Direct IO evicts the range of the pagecache being read or written to.

For reads, we need dirty pages to be written to disk, so that the read
doesn't return stale data. For writes, we need to evict that range of
the pagecache so that it's not stale after the write completes.

However, without a locking mechanism to prevent those pages from being
re-added to the pagecache - by a buffered read or page fault - page
cache inconsistency is still possible.

This isn't necessarily just an issue for userspace when they're playing
games; filesystems may hang arbitrary state off the pagecache, and so
page cache inconsistency may cause real filesystem bugs, depending on
the filesystem. This is less of an issue for iomap based filesystems,
but e.g. buffer heads caches disk block mappings (!) and attaches them
to the pagecache, and bcachefs attaches disk reservations to pagecache
pages.

This issue has been hard to fix, because
 - we need to add a lock (henceforth calld pagecache_add_lock), which
   would be held for the duration of the direct IO
 - page faults add pages to the page cache, thus need to take the same
   lock
 - dio -> gup -> page fault thus can deadlock

And we cannot enforce a lock ordering with this lock, since userspace
will be controlling the lock ordering (via the fd and buffer arguments
to direct IOs), so we need a different method of deadlock avoidance.

We need to tell the page fault handler that we're already holding a
pagecache_add_lock, and since plumbing it through the entire gup() path
would be highly impractical this adds a field to task_struct.

Then the full method is:
 - in the dio path, when we take first pagecache_add_lock, note the
   mapping in task_struct
 - in the page fault handler, if faults_disabled_mapping is set, we
   check if it's the same mapping as the one taking a page fault for,
   and if so return an error.

   Then we check lock ordering: if there's a lock ordering violation and
   trylock fails, we'll have to cycle the locks and return an error that
   tells the DIO path to retry: faults_disabled_mapping is also used for
   signalling "locks were dropped, please retry".

Also relevant to this patch: mapping->invalidate_lock.
mapping->invalidate_lock provides most of the required semantics - it's
used by truncate/fallocate to block pages being added to the pagecache.
However, since it's a rwsem, direct IOs would need to take the write
side in order to block page cache adds, and would then be exclusive with
each other - we'll need a new type of lock to pair with this approach.

Signed-off-by: Kent Overstreet 
Cc: Jan Kara 
Cc: Darrick J. Wong 
Cc: linux-fsde...@vger.kernel.org
Cc: Andreas Grünbacher 



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Fr., 26. Mai 2023 um 01:20 Uhr schrieb Kent Overstreet
:
> On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> > Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig 
> > :
> > > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a 
> > > > different
> > > > way (by prefaulting pages from the iter before grabbing the problematic
> > > > lock and then disabling page faults for the iomap_dio_rw() call). I 
> > > > guess
> > > > we should somehow unify these schemes so that we don't have two 
> > > > mechanisms
> > > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > > >
> > > > Also good that you've written a fstest for this, that is definitely a 
> > > > useful
> > > > addition, although I suspect GFS2 guys added a test for this not so long
> > > > ago when testing their stuff. Maybe they have a pointer handy?
> > >
> > > generic/708 is the btrfs version of this.
> > >
> > > But I think all of the file systems that have this deadlock are actually
> > > fundamentally broken because they have a mess up locking hierarchy
> > > where page faults take the same lock that is held over the the direct I/
> > > operation.  And the right thing is to fix this.  I have work in progress
> > > for btrfs, and something similar should apply to gfs2, with the added
> > > complication that it probably means a revision to their network
> > > protocol.
> >
> > We do disable page faults, and there can be deadlocks in page fault
> > handlers while no page faults are allowed.
> >
> > I'm roughly aware of the locking hierarchy that other filesystems use,
> > and that's something we want to avoid because of two reasons: (1) it
> > would be an incompatible change, and (2) we want to avoid cluster-wide
> > locking operations as much as possible because they are very slow.
> >
> > These kinds of locking conflicts are so rare in practice that the
> > theoretical inefficiency of having to retry the operation doesn't
> > matter.
>
> Would you be willing to expand on that? I'm wondering if this would
> simplify things for gfs2, but you mention locking heirarchy being an
> incompatible change - how does that work?

Oh, it's just that gfs2 uses one dlm lock per inode to control access
to that inode. In the code, this is called the "inode glock" ---
glocks being an abstraction above dlm locks --- but it boils down to
dlm locks in the end. An additional layer of locking will only work
correctly if all cluster nodes use the new locks consistently, so old
cluster nodes will become incompatible. Those kinds of changes are
hard.

But the additional lock taking would also hurt performance, forever,
and I'd really like to avoid taking that hit.

It may not be obvious to everyone, but allowing page faults during
reads and writes (i.e., while holding dlm locks) can lead to
distributed deadlocks. We cannot just pretend to be a local
filesystem.

Thanks,
Andreas

> > > I'm absolutely not in favour to add workarounds for thes kind of locking
> > > problems to the core kernel.  I already feel bad for allowing the
> > > small workaround in iomap for btrfs, as just fixing the locking back
> > > then would have avoid massive ratholing.
> >
> > Please let me know when those btrfs changes are in a presentable shape ...
>
> I would also be curious to know what btrfs needs and what the approach
> is there.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Kent Overstreet
On Fri, May 26, 2023 at 12:25:31AM +0200, Andreas Grünbacher wrote:
> Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig 
> :
> > On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > >
> > > Also good that you've written a fstest for this, that is definitely a 
> > > useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > generic/708 is the btrfs version of this.
> >
> > But I think all of the file systems that have this deadlock are actually
> > fundamentally broken because they have a mess up locking hierarchy
> > where page faults take the same lock that is held over the the direct I/
> > operation.  And the right thing is to fix this.  I have work in progress
> > for btrfs, and something similar should apply to gfs2, with the added
> > complication that it probably means a revision to their network
> > protocol.
> 
> We do disable page faults, and there can be deadlocks in page fault
> handlers while no page faults are allowed.
> 
> I'm roughly aware of the locking hierarchy that other filesystems use,
> and that's something we want to avoid because of two reasons: (1) it
> would be an incompatible change, and (2) we want to avoid cluster-wide
> locking operations as much as possible because they are very slow.
> 
> These kinds of locking conflicts are so rare in practice that the
> theoretical inefficiency of having to retry the operation doesn't
> matter.

Would you be willing to expand on that? I'm wondering if this would
simplify things for gfs2, but you mention locking heirarchy being an
incompatible change - how does that work?

> 
> > I'm absolutely not in favour to add workarounds for thes kind of locking
> > problems to the core kernel.  I already feel bad for allowing the
> > small workaround in iomap for btrfs, as just fixing the locking back
> > then would have avoid massive ratholing.
> 
> Please let me know when those btrfs changes are in a presentable shape ...

I would also be curious to know what btrfs needs and what the approach
is there.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Do., 25. Mai 2023 um 10:56 Uhr schrieb Jan Kara :
> On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > > No, that's definitely handled (and you can see it in the code I linked),
> > > > and I wrote a torture test for fstests as well.
> > >
> > > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > > way (by prefaulting pages from the iter before grabbing the problematic
> > > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > > we should somehow unify these schemes so that we don't have two mechanisms
> > > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Oof, that sounds a bit sketchy. What happens if the dio call passes in
> > an address from the same address space?
>
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?

Yes, I guess. Thanks for the heads-up.

Andreas

> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.
>
> > What happens if we race with the pages we faulted in being evicted?
>
> We fault them in again and retry.
>
> > > Also good that you've written a fstest for this, that is definitely a 
> > > useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> >
> > More tests more good.
> >
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
>
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.
>
> Honza
> --
> Jan Kara 
> SUSE Labs, CR



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Di., 23. Mai 2023 um 18:28 Uhr schrieb Christoph Hellwig 
:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> >
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
>
> generic/708 is the btrfs version of this.
>
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation.  And the right thing is to fix this.  I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

We do disable page faults, and there can be deadlocks in page fault
handlers while no page faults are allowed.

I'm roughly aware of the locking hierarchy that other filesystems use,
and that's something we want to avoid because of two reasons: (1) it
would be an incompatible change, and (2) we want to avoid cluster-wide
locking operations as much as possible because they are very slow.

These kinds of locking conflicts are so rare in practice that the
theoretical inefficiency of having to retry the operation doesn't
matter.

> I'm absolutely not in favour to add workarounds for thes kind of locking
> problems to the core kernel.  I already feel bad for allowing the
> small workaround in iomap for btrfs, as just fixing the locking back
> then would have avoid massive ratholing.

Please let me know when those btrfs changes are in a presentable shape ...

Thanks,
Andreas



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Andreas Grünbacher
Am Di., 23. Mai 2023 um 15:37 Uhr schrieb Jan Kara :
> On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > > From: Kent Overstreet 
> > > >
> > > > This is used by bcachefs to fix a page cache coherency issue with
> > > > O_DIRECT writes.
> > > >
> > > > Also relevant: mapping->invalidate_lock, see below.
> > > >
> > > > O_DIRECT writes (and other filesystem operations that modify file data
> > > > while bypassing the page cache) need to shoot down ranges of the page
> > > > cache - and additionally, need locking to prevent those pages from
> > > > pulled back in.
> > > >
> > > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > > and the page fault handler will need to take that same lock - this is a
> > > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > > writing to and uses those pages for the buffer to write from, and it's a
> > > > lock ordering deadlock in general.
> > > >
> > > > Thus we need a way to signal from the dio code to the page fault handler
> > > > when we already are holding the pagecache add lock on an address space -
> > > > this patch just adds a member to task_struct for this purpose. For now
> > > > only bcachefs is implementing this locking, though it may be moved out
> > > > of bcachefs and made available to other filesystems in the future.
> > >
> > > It would be nice to have at least a link to the code that's actually using
> > > the field you are adding.
> >
> > Bit of a trick to link to a _later_ patch in the series from a commit
> > message, but...
> >
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> > https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454
>
> Thanks and I'm sorry for the delay.
>
> > > Also I think we were already through this discussion [1] and we ended up
> > > agreeing that your scheme actually solves only the AA deadlock but a
> > > malicious userspace can easily create AB BA deadlock by running direct IO
> > > to file A using mapped file B as a buffer *and* direct IO to file B using
> > > mapped file A as a buffer.
> >
> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
>
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
>
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

Ah yes, that's xfstests commit d3cbdabf ("generic: Test page faults
during read and write").

Thanks,
Andreas

> Honza
> --
> Jan Kara 
> SUSE Labs, CR



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Kent Overstreet
On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > > Well, it seems like you are talking about something else than the
> > > existing cases in gfs2 and btrfs, that is you want full consistency
> > > between direct I/O and buffered I/O.  That's something nothing in the
> > > kernel has ever provided, so I'd be curious why you think you need it
> > > and want different semantics from everyone else?
> > 
> > Because I like code that is correct.
> 
> Well, start with explaining your definition of correctness, why everyone
> else is "not correct", an how you can help fixing this correctness
> problem in the existing kernel.  Thanks for your cooperation!

BTW, if you wanted a more serious answer, just asking for the commit
message to be expanded would be a better way to ask...



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Kent Overstreet
On Thu, May 25, 2023 at 10:47:31AM +0200, Jan Kara wrote:
> If we submit direct IO that uses mapped file F at offset O as a buffer for
> direct IO from file F, offset O, it will currently livelock in an
> indefinite retry loop. It should rather return error or fall back to
> buffered IO. But that should be fixable. Andreas?
> 
> But if the buffer and direct IO range does not overlap, it will just
> happily work - iomap_dio_rw() invalidates only the range direct IO is done
> to.

*nod*

readahead triggered from the page fault path is another consideration.
No idea how that interacts with the gf2s method; IIRC there's a hack in
the page fault path that says somewhere "we may be getting called via
gup(), don't invoke readahead".

We could potentially kill that hack if we lifted this to the VFS layer.

> 
> > What happens if we race with the pages we faulted in being evicted?
> 
> We fault them in again and retry.
> 
> > > Also good that you've written a fstest for this, that is definitely a 
> > > useful
> > > addition, although I suspect GFS2 guys added a test for this not so long
> > > ago when testing their stuff. Maybe they have a pointer handy?
> > 
> > More tests more good.
> > 
> > So if we want to lift this scheme to the VFS layer, we'd start by
> > replacing the lock you added (grepping for it, the name escapes me) with
> > a different type of lock - two_state_shared_lock in my code, it's like a
> > rw lock except writers don't exclude other writers. That way the DIO
> > path can use it without singlethreading writes to a single file.
> 
> Yes, I've noticed that you are introducing in bcachefs a lock with very
> similar semantics to mapping->invalidate_lock, just with this special lock
> type. What I'm kind of worried about with two_state_shared_lock as
> implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
> heavily faulting pages on a file, direct IO to that file can be starved
> indefinitely. That is IMHO not a good thing and I would not like to use
> this type of lock in VFS until this problem is resolved. But it should be
> fixable e.g. by introducing some kind of deadline for a waiter after which
> it will block acquisitions of the other lock state.

Yeah, my two_state_shared lock is definitely at the quick and dirty
prototype level, the implementation would need work. Lockdep support
would be another hard requirement.

The deadline might be a good idea, OTOH it'd want tuning. Maybe
something like what rwsem does where we block new read acquirerers if
there's a writer waiting would work.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Kent Overstreet
On Thu, May 25, 2023 at 01:58:13AM -0700, Christoph Hellwig wrote:
> On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > > Well, it seems like you are talking about something else than the
> > > existing cases in gfs2 and btrfs, that is you want full consistency
> > > between direct I/O and buffered I/O.  That's something nothing in the
> > > kernel has ever provided, so I'd be curious why you think you need it
> > > and want different semantics from everyone else?
> > 
> > Because I like code that is correct.
> 
> Well, start with explaining your definition of correctness, why everyone
> else is "not correct", an how you can help fixing this correctness
> problem in the existing kernel.  Thanks for your cooperation!

A cache that isn't actually consistent is a _bug_. You're being
Obsequious. And any time this has come up in previous discussions
(including at LSF), that was never up for debate, the only question has
been whether it was even possible to practically fix it.

The DIO code recognizes cache incoherency as something to be avoided by
shooting down the page cache both at the beginning of the IO _and again
at the end_.  That's the kind of obvious hackery for a race condition
that we would like to avoid.

Regarding the consequences of this kind of bug - stale data exposed to
userspace, possibly stale data overwriting a write we acked, and worse
any filesystem state that hangs off the page cache being inconsistent
with the data on disk.

And look, we've been over all this before, so I don't see what this adds
to the discussion.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Christoph Hellwig
On Wed, May 24, 2023 at 04:09:02AM -0400, Kent Overstreet wrote:
> > Well, it seems like you are talking about something else than the
> > existing cases in gfs2 and btrfs, that is you want full consistency
> > between direct I/O and buffered I/O.  That's something nothing in the
> > kernel has ever provided, so I'd be curious why you think you need it
> > and want different semantics from everyone else?
> 
> Because I like code that is correct.

Well, start with explaining your definition of correctness, why everyone
else is "not correct", an how you can help fixing this correctness
problem in the existing kernel.  Thanks for your cooperation!



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-25 Thread Jan Kara
On Tue 23-05-23 12:49:06, Kent Overstreet wrote:
> > > No, that's definitely handled (and you can see it in the code I linked),
> > > and I wrote a torture test for fstests as well.
> > 
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> 
> Oof, that sounds a bit sketchy. What happens if the dio call passes in
> an address from the same address space?

If we submit direct IO that uses mapped file F at offset O as a buffer for
direct IO from file F, offset O, it will currently livelock in an
indefinite retry loop. It should rather return error or fall back to
buffered IO. But that should be fixable. Andreas?

But if the buffer and direct IO range does not overlap, it will just
happily work - iomap_dio_rw() invalidates only the range direct IO is done
to.

> What happens if we race with the pages we faulted in being evicted?

We fault them in again and retry.

> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
> 
> More tests more good.
> 
> So if we want to lift this scheme to the VFS layer, we'd start by
> replacing the lock you added (grepping for it, the name escapes me) with
> a different type of lock - two_state_shared_lock in my code, it's like a
> rw lock except writers don't exclude other writers. That way the DIO
> path can use it without singlethreading writes to a single file.

Yes, I've noticed that you are introducing in bcachefs a lock with very
similar semantics to mapping->invalidate_lock, just with this special lock
type. What I'm kind of worried about with two_state_shared_lock as
implemented in bcachefs is the fairness. AFAICS so far if someone is e.g.
heavily faulting pages on a file, direct IO to that file can be starved
indefinitely. That is IMHO not a good thing and I would not like to use
this type of lock in VFS until this problem is resolved. But it should be
fixable e.g. by introducing some kind of deadline for a waiter after which
it will block acquisitions of the other lock state.

Honza
-- 
Jan Kara 
SUSE Labs, CR



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-24 Thread Kent Overstreet
On Tue, May 23, 2023 at 11:43:32PM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote:
> > No, this is fundamentally because userspace controls the ordering of
> > locking because the buffer passed to dio can point into any address
> > space. You can't solve this by changing the locking heirarchy.
> > 
> > If you want to be able to have locking around adding things to the
> > pagecache so that things that bypass the pagecache can prevent
> > inconsistencies (and we do, the big one is fcollapse), and if you want
> > dio to be able to use that same locking (because otherwise dio will also
> > cause page cache inconsistency), this is the way to do it.
> 
> Well, it seems like you are talking about something else than the
> existing cases in gfs2 and btrfs, that is you want full consistency
> between direct I/O and buffered I/O.  That's something nothing in the
> kernel has ever provided, so I'd be curious why you think you need it
> and want different semantics from everyone else?

Because I like code that is correct.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-24 Thread Christoph Hellwig
On Tue, May 23, 2023 at 12:35:35PM -0400, Kent Overstreet wrote:
> No, this is fundamentally because userspace controls the ordering of
> locking because the buffer passed to dio can point into any address
> space. You can't solve this by changing the locking heirarchy.
> 
> If you want to be able to have locking around adding things to the
> pagecache so that things that bypass the pagecache can prevent
> inconsistencies (and we do, the big one is fcollapse), and if you want
> dio to be able to use that same locking (because otherwise dio will also
> cause page cache inconsistency), this is the way to do it.

Well, it seems like you are talking about something else than the
existing cases in gfs2 and btrfs, that is you want full consistency
between direct I/O and buffered I/O.  That's something nothing in the
kernel has ever provided, so I'd be curious why you think you need it
and want different semantics from everyone else?



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Christoph Hellwig
On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> 
> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

generic/708 is the btrfs version of this.

But I think all of the file systems that have this deadlock are actually
fundamentally broken because they have a mess up locking hierarchy
where page faults take the same lock that is held over the the direct I/
operation.  And the right thing is to fix this.  I have work in progress
for btrfs, and something similar should apply to gfs2, with the added
complication that it probably means a revision to their network
protocol.

I'm absolutely not in favour to add workarounds for thes kind of locking
problems to the core kernel.  I already feel bad for allowing the
small workaround in iomap for btrfs, as just fixing the locking back
then would have avoid massive ratholing.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Kent Overstreet
> > No, that's definitely handled (and you can see it in the code I linked),
> > and I wrote a torture test for fstests as well.
> 
> I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> way (by prefaulting pages from the iter before grabbing the problematic
> lock and then disabling page faults for the iomap_dio_rw() call). I guess
> we should somehow unify these schemes so that we don't have two mechanisms
> for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Oof, that sounds a bit sketchy. What happens if the dio call passes in
an address from the same address space? What happens if we race with the
pages we faulted in being evicted?

> Also good that you've written a fstest for this, that is definitely a useful
> addition, although I suspect GFS2 guys added a test for this not so long
> ago when testing their stuff. Maybe they have a pointer handy?

More tests more good.

So if we want to lift this scheme to the VFS layer, we'd start by
replacing the lock you added (grepping for it, the name escapes me) with
a different type of lock - two_state_shared_lock in my code, it's like a
rw lock except writers don't exclude other writers. That way the DIO
path can use it without singlethreading writes to a single file.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Kent Overstreet
On Tue, May 23, 2023 at 09:21:56AM -0700, Christoph Hellwig wrote:
> On Tue, May 23, 2023 at 03:34:31PM +0200, Jan Kara wrote:
> > I've checked the code and AFAICT it is all indeed handled. BTW, I've now
> > remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
> > ("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
> > way (by prefaulting pages from the iter before grabbing the problematic
> > lock and then disabling page faults for the iomap_dio_rw() call). I guess
> > we should somehow unify these schemes so that we don't have two mechanisms
> > for avoiding exactly the same deadlock. Adding GFS2 guys to CC.
> > 
> > Also good that you've written a fstest for this, that is definitely a useful
> > addition, although I suspect GFS2 guys added a test for this not so long
> > ago when testing their stuff. Maybe they have a pointer handy?
> 
> generic/708 is the btrfs version of this.
> 
> But I think all of the file systems that have this deadlock are actually
> fundamentally broken because they have a mess up locking hierarchy
> where page faults take the same lock that is held over the the direct I/
> operation.  And the right thing is to fix this.  I have work in progress
> for btrfs, and something similar should apply to gfs2, with the added
> complication that it probably means a revision to their network
> protocol.

No, this is fundamentally because userspace controls the ordering of
locking because the buffer passed to dio can point into any address
space. You can't solve this by changing the locking heirarchy.

If you want to be able to have locking around adding things to the
pagecache so that things that bypass the pagecache can prevent
inconsistencies (and we do, the big one is fcollapse), and if you want
dio to be able to use that same locking (because otherwise dio will also
cause page cache inconsistency), this is the way to do it.



Re: [Cluster-devel] [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping

2023-05-23 Thread Jan Kara
On Wed 10-05-23 02:18:45, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> > On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > > From: Kent Overstreet 
> > > 
> > > This is used by bcachefs to fix a page cache coherency issue with
> > > O_DIRECT writes.
> > > 
> > > Also relevant: mapping->invalidate_lock, see below.
> > > 
> > > O_DIRECT writes (and other filesystem operations that modify file data
> > > while bypassing the page cache) need to shoot down ranges of the page
> > > cache - and additionally, need locking to prevent those pages from
> > > pulled back in.
> > > 
> > > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > > and the page fault handler will need to take that same lock - this is a
> > > classic recursive deadlock if userspace has mmaped the file they're DIO
> > > writing to and uses those pages for the buffer to write from, and it's a
> > > lock ordering deadlock in general.
> > > 
> > > Thus we need a way to signal from the dio code to the page fault handler
> > > when we already are holding the pagecache add lock on an address space -
> > > this patch just adds a member to task_struct for this purpose. For now
> > > only bcachefs is implementing this locking, though it may be moved out
> > > of bcachefs and made available to other filesystems in the future.
> > 
> > It would be nice to have at least a link to the code that's actually using
> > the field you are adding.
> 
> Bit of a trick to link to a _later_ patch in the series from a commit
> message, but...
> 
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

Thanks and I'm sorry for the delay.

> > Also I think we were already through this discussion [1] and we ended up
> > agreeing that your scheme actually solves only the AA deadlock but a
> > malicious userspace can easily create AB BA deadlock by running direct IO
> > to file A using mapped file B as a buffer *and* direct IO to file B using
> > mapped file A as a buffer.
> 
> No, that's definitely handled (and you can see it in the code I linked),
> and I wrote a torture test for fstests as well.

I've checked the code and AFAICT it is all indeed handled. BTW, I've now
remembered that GFS2 has dealt with the same deadlocks - b01b2d72da25
("gfs2: Fix mmap + page fault deadlocks for direct I/O") - in a different
way (by prefaulting pages from the iter before grabbing the problematic
lock and then disabling page faults for the iomap_dio_rw() call). I guess
we should somehow unify these schemes so that we don't have two mechanisms
for avoiding exactly the same deadlock. Adding GFS2 guys to CC.

Also good that you've written a fstest for this, that is definitely a useful
addition, although I suspect GFS2 guys added a test for this not so long
ago when testing their stuff. Maybe they have a pointer handy?

Honza
-- 
Jan Kara 
SUSE Labs, CR