Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
Hi, On Sat, 2021-06-12 at 21:35 +, Al Viro wrote: > On Sat, Jun 12, 2021 at 09:05:40PM +, Al Viro wrote: > > > Is the above an accurate description of the mainline situation > > there? > > In particular, normal read doesn't seem to bother with locks at > > all. > > What exactly are those cluster locks for in O_DIRECT read? > > BTW, assuming the lack of contention, how costly is > dropping/regaining > such cluster lock? > The answer is that it depends... The locking modes for glocks for inodes look like this: == == == == == Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata == == == == == UN No No NoNo SH Yes YesNoNo DF No YesNoNo EX Yes YesYes Yes == == == == == The above is a copy & paste from Documentation/filesystems/gfs2- glocks.rst. If you think of these locks as cache control, then it makes a lot more sense. The DF (deferred) mode is there only for DIO. It is a shared lock mode that is incompatible with the normal SH mode. That is because it is ok to cache data pages under SH but not under DF. That the only other difference between the two shared modes. DF is used for both read and write under DIO meaning that it is possible for multiple nodes to read & write the same file at the same time with DIO, leaving any synchronisation to the application layer. As soon as one performs an operation which alters the metadata tree (truncate, extend, hole filling) then we drop back to the normal EX mode, so DF is only used for preallocated files. Your original question though was about the cost of locking, and there is a wide variation according to circumstances. The glock layer caches the results of the DLM requests and will continue to hold glocks gained from remote nodes until either memory pressure or requests to drop the lock from another node is received. When no other nodes are interested in a lock, all such cluster lock activity is local. There is a cost to it though, and if (for example) you tried to take and drop the cluster lock on every page, that would definitely be noticeable. There are probably optimisations that could be done on what is quite a complex code path, but in general thats what we've discovered from testing. The introduction of ->readpages() vs the old ->readpage() made a measurable difference and likewise on the write side, iomap has also show performance increases due to the reduction in locking on multi-page writes. If there is another node that has an interest in a lock, then it can get very expensive in terms of latency to regain a lock. To drop the lock to a lower mode may involve I/O (from EX mode) and journal flush(es) and to get the lock back again involves I/O to other nodes and then a wait while they finish what they are doing. To avoid starvation there is a "minimum hold time" so that when a node gains a glock, it is allowed to retain it, in the absence of local requests, for a short period. The idea being that if a large number of glock requests are being made on a node, each for a short time, we allow several of those to complete before we do the expensive glock release to another node. See Documentation/filesystems/gfs2-glocks.rst for a longer explanation and locking order/rules between different lock types, Steve.
Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
On Sat, Jun 12, 2021 at 09:05:40PM +, Al Viro wrote: > Is the above an accurate description of the mainline situation there? > In particular, normal read doesn't seem to bother with locks at all. > What exactly are those cluster locks for in O_DIRECT read? BTW, assuming the lack of contention, how costly is dropping/regaining such cluster lock?
Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
On Fri, Jun 11, 2021 at 04:25:10PM +, Al Viro wrote: > On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote: > > > Well, iomap_file_buffered_write() does that by using > > iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as > > in iomap_write_actor(), but the read and direct I/O side doesn't seem > > to have equivalents. I suspect we can't just wrap > > generic_file_read_iter() and iomap_dio_rw() calls in > > pagefault_disable(). > > And it will have zero effect on O_DIRECT case, so you get the same > deadlocks right back. Because there you hit > iomap_dio_bio_actor() > bio_iov_iter_get_pages() > > get_user_pages_fast() > > faultin_page() > handle_mm_fault() > and at no point had CPU hit an exception, so disable_pagefault() will > have no effect whatsoever. You can bloody well hit gfs2 readpage/mkwrite > if the destination is in mmapped area of some GFS2 file. Do that > while holding GFS2 locks and you are fucked. > > No amount of prefaulting will protect you, BTW - it might make the > deadlock harder to reproduce, but that's it. AFAICS, what we have is * handle_mm_fault() can hit gfs2_fault(), which grabs per-inode lock shared * handle_mm_fault() for write can hit gfs2_page_mkwrite(), which grabs per-inode lock exclusive * pagefault_disable() prevents that for real page faults, but not for get_user_pages_fast() * normal write: with inode_lock(inode) in a loop with per-inode lock exclusive __gfs2_iomap_get possibly gfs2_iomap_begin_write in a loop fault-in [read faults] iomap_write_begin copy_page_from_iter_atomic() [pf disabled] iomap_write_end gfs2_iomap_end * O_DIRECT write: with inode_lock(inode) and per-inode lock deferred (?) in a loop __gfs2_iomap_get possibly gfs2_iomap_begin_write bio_iov_iter_get_pages(), map and submit [gup] gfs2_iomap_end * normal read: in a loop filemap_get_pages (grab pages and readpage them if needed) copy_page_to_iter() for each [write faults] * O_DIRECT read: with per-inode lock deferred in a loop __gfs2_iomap_get either iov_iter_zero() (on hole) [write faults] or bio_iov_iter_get_pages(), map and submit [gup] gfs2_iomap_end ... with some amount of waiting on buffered IO in case of O_DIRECT writes Is the above an accurate description of the mainline situation there? In particular, normal read doesn't seem to bother with locks at all. What exactly are those cluster locks for in O_DIRECT read?
Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote: > Well, iomap_file_buffered_write() does that by using > iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as > in iomap_write_actor(), but the read and direct I/O side doesn't seem > to have equivalents. I suspect we can't just wrap > generic_file_read_iter() and iomap_dio_rw() calls in > pagefault_disable(). And it will have zero effect on O_DIRECT case, so you get the same deadlocks right back. Because there you hit iomap_dio_bio_actor() bio_iov_iter_get_pages() get_user_pages_fast() faultin_page() handle_mm_fault() and at no point had CPU hit an exception, so disable_pagefault() will have no effect whatsoever. You can bloody well hit gfs2 readpage/mkwrite if the destination is in mmapped area of some GFS2 file. Do that while holding GFS2 locks and you are fucked. No amount of prefaulting will protect you, BTW - it might make the deadlock harder to reproduce, but that's it.
Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
On Tue, Jun 1, 2021 at 8:00 AM Linus Torvalds wrote: > On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher > wrote: > > > > Fix that by recognizing the self-recursion case. > > Hmm. I get the feeling that the self-recursion case should never have > been allowed to happen in the first place. > > IOW, is there some reason why you can't make the user accesses always > be done with page faults disabled (ie using the "atomic" user space > access model), and then if you get a partial read (or write) to user > space, at that point you drop the locks in read/write, do the "try to > make readable/writable" and try again. > > IOW, none of this "detect recursion" thing. Just "no recursion in the > first place". > > That way you'd not have these odd rules at fault time at all, because > a fault while holding a lock would never get to the filesystem at all, > it would be aborted early. And you'd not have any odd "inner/outer" > locks, or lock compatibility rules or anything like that. You'd > literally have just "oh, I didn't get everything at RW time while I > held locks, so let's drop the locks, try to access user space, and > retry". Well, iomap_file_buffered_write() does that by using iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as in iomap_write_actor(), but the read and direct I/O side doesn't seem to have equivalents. I suspect we can't just wrap generic_file_read_iter() and iomap_dio_rw() calls in pagefault_disable(). > Wouldn't that be a lot simpler and more robust? Sure, with vfs primitives that support atomic user-space access and with a iov_iter_fault_in_writeable() like operation, we could do that. > Because what if the mmap is something a bit more complex, like > overlayfs or userfaultfd, and completing the fault isn't about gfs2 > handling it as a "fault", but about some *other* entity calling back > to gfs2 and doing a read/write instead? Now all your "inner/outer" > lock logic ends up being entirely pointless, as far as I can tell, and > you end up deadlocking on the lock you are holding over the user space > access _anyway_. Yes, those kinds of deadlocks would still be possible. Until we have a better solution, wouldn't it make sense to at least prevent those self-recursion deadlocks? I'll send a separate pull request in case you find that acceptable. Thanks, Andreas
Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher wrote: > > Fix that by recognizing the self-recursion case. Hmm. I get the feeling that the self-recursion case should never have been allowed to happen in the first place. IOW, is there some reason why you can't make the user accesses always be doen with page faults disabled (ie using the "atomic" user space access model), and then if you get a partial read (or write) to user space, at that point you drop the locks in read/write, do the "try to make readable/writable" and try again. IOW, none of this "detect recursion" thing. Just "no recursion in the first place". That way you'd not have these odd rules at fault time at all, because a fault while holding a lock would never get to the filesystem at all, it would be aborted early. And you'd not have any odd "inner/outer" locks, or lock compatibility rules or anything like that. You'd literally have just "oh, I didn't get everything at RW time while I held locks, so let's drop the locks, try to access user space, and retry". Wouldn't that be a lot simpler and more robust? Because what if the mmap is something a bit more complex, like overlayfs or usefaultfd, and completing the fault isn't about gfs2 handling it as a "fault", but about some *other* entity calling back to gfs2 and doing a read/write instead? Now all your "inner/outer" lock logic ends up being entirely pointless, as far as I can tell, and you end up deadlocking on the lock you are holding over the user space access _anyway_. So I literally think that your approach is (a) too complicated (b) doesn't actually fix the issue in the more general case But maybe I'm missing something. Linus Linus
[Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
When the buffer passed to a read or write system call is memory mapped to the same file, a page fault can occur in filemap_fault. In that case, the task will already be holding the inode glock, and trying to take the same lock again will result in a BUG in add_to_queue(). Fix that by recognizing the self-recursion case. Either skip the lock taking (when the glock is held in a compatible way), or fail the operation. Likewise, a request to un-share a copy-on-write page can *probably* happen in similar situations, so treat the locking in gfs2_page_mkwrite in the same way. A future patch will handle these case more gracefully by retrying operations instead of failing them, along with addressing more complex deadlock scenarios. Reported-by: Jan Kara Signed-off-by: Andreas Gruenbacher --- fs/gfs2/file.c | 40 ++-- 1 file changed, 30 insertions(+), 10 deletions(-) diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 6d77743f11a4..7d88abb4629b 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -423,6 +423,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) struct page *page = vmf->page; struct inode *inode = file_inode(vmf->vma->vm_file); struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl); struct gfs2_sbd *sdp = GFS2_SB(inode); struct gfs2_alloc_parms ap = { .aflags = 0, }; u64 offset = page_offset(page); @@ -436,10 +437,18 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) sb_start_pagefault(inode->i_sb); gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh); - err = gfs2_glock_nq(&gh); - if (err) { - ret = block_page_mkwrite_return(err); - goto out_uninit; + if (likely(!outer_gh)) { + err = gfs2_glock_nq(&gh); + if (err) { + ret = block_page_mkwrite_return(err); + goto out_uninit; + } + } else { + if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) { + /* We could try to upgrade outer_gh here. */ + ret = VM_FAULT_SIGBUS; + goto out_uninit; + } } /* Check page index against inode size */ @@ -540,7 +549,8 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) out_quota_unlock: gfs2_quota_unlock(ip); out_unlock: - gfs2_glock_dq(&gh); + if (likely(!outer_gh)) + gfs2_glock_dq(&gh); out_uninit: gfs2_holder_uninit(&gh); if (ret == VM_FAULT_LOCKED) { @@ -555,6 +565,7 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) { struct inode *inode = file_inode(vmf->vma->vm_file); struct gfs2_inode *ip = GFS2_I(inode); + struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl); struct gfs2_holder gh; vm_fault_t ret; u16 state; @@ -562,13 +573,22 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED; gfs2_holder_init(ip->i_gl, state, 0, &gh); - err = gfs2_glock_nq(&gh); - if (err) { - ret = block_page_mkwrite_return(err); - goto out_uninit; + if (likely(!outer_gh)) { + err = gfs2_glock_nq(&gh); + if (err) { + ret = block_page_mkwrite_return(err); + goto out_uninit; + } + } else { + if (!gfs2_holder_is_compatible(outer_gh, state)) { + /* We could try to upgrade outer_gh here. */ + ret = VM_FAULT_SIGBUS; + goto out_uninit; + } } ret = filemap_fault(vmf); - gfs2_glock_dq(&gh); + if (likely(!outer_gh)) + gfs2_glock_dq(&gh); out_uninit: gfs2_holder_uninit(&gh); return ret; -- 2.26.3