On Mon, May 18, 2026 at 4:26 AM Barry Song <[email protected]> wrote: > > On Mon, May 18, 2026 at 5:47 PM Lorenzo Stoakes <[email protected]> wrote: > > > > On Sun, May 17, 2026 at 04:45:15PM +0800, Barry Song wrote: > > > On Sat, May 2, 2026 at 1:58 AM Matthew Wilcox <[email protected]> wrote: > > > > > > > > On Sat, May 02, 2026 at 01:44:34AM +0800, Barry Song wrote: > > > > > On Fri, May 1, 2026 at 10:57 PM Matthew Wilcox <[email protected]> > > > > > wrote: > > > > > > > > > > > > On Fri, May 01, 2026 at 06:49:58AM +0800, Barry Song wrote: > > > > > > > 1. There is no deterministic latency for I/O completion. It > > > > > > > depends on > > > > > > > both the hardware and the software stack (bio/request queues and > > > > > > > the > > > > > > > block scheduler). Sometimes the latency is short; at other times > > > > > > > it can > > > > > > > be quite long. In such cases, a high-priority thread performing > > > > > > > operations > > > > > > > such as mprotect, unmap, prctl_set_vma, or madvise may be forced > > > > > > > to wait > > > > > > > for an unpredictable amount of time. > > > > > > > > > > > > But does that actually happen? I find it hard to believe that > > > > > > thread A > > > > > > unmaps a VMA while thread B is in the middle of taking a page fault > > > > > > in > > > > > > that same VMA. mprotect() and madvise() are more likely to happen, > > > > > > but > > > > > > it still seems really unlikely to me. > > > > > > > > > > It doesn’t have to involve unmapping or applying mprotect to > > > > > the entire VMA—just a portion of it is sufficient. > > > > > > > > Yes, but that still fails to answer "does this actually happen". How > > > > much > > > > performance is all this complexity in the page fault handler buying us? > > > > If you don't answer this question, I'm just going to go in and rip it > > > > all out. > > > > > > > > > > Hi Matthew (and Lorenzo, Jan, and anyone else who may be > > > waiting for answers), > > > > > > As promised during LSF/MM/BPF, we conducted thorough > > > testing on Android phones to determine whether performing > > > I/O in `filemap_fault()` can block `vma_start_write()`. > > > I wanted to give a quick update on this question. > > > > > > Nanzhe at Xiaomi created tracing scripts and ran various > > > applications on Android devices with I/O performed under > > > the VMA lock in `filemap_fault()`. We found that: > > > > > > 1. There are very few cases where unmap() is blocked by > > > page faults. I assume this is due to buggy user code > > > or poor synchronization between reads and unmap(). > > > So I assume it is not a problem. > > > > > > 2. We observed many cases where `vma_start_write()` > > > is blocked by page-fault I/O in some applications. > > > The blocking occurs in the `dup_mmap()` path during > > > fork(). > > > > > > With Suren's commit fb49c455323ff ("fork: lock VMAs of > > > the parent process when forking"), we now always hold > > > `vma_write_lock()` for each VMA. Note that the > > > `mmap_lock` write lock is also held, which could lead to > > > chained waiting if page-fault I/O is performed without > > > releasing the VMA lock. > > > > Hm but did you observe this 'chained waiting'? And what were the latencies? > > We have clearly observed that the `fork()` operations of many > popular Android apps, such as iQiyi, Baidu Tieba, and 10086, > end up waiting on page-fault (PF) I/O when the VMA lock is > held during I/O operations. This has already become a > practical issue. I also believe this can lead to chained > waiting, since the global `mmap_lock` blocks all threads that > need to acquire it. > > > > > > > > > > My gut feeling is that Suren's commit may be overshooting, > > > so my rough idea is that we might want to do something like > > > the following (we haven't tested it yet and it might be > > > wrong): > > > > Yeah I'm really not sure about that. > > > > Prior to the VMA locks, the mmap write lock would have guaranteed no > > concurrent > > page faults, which is really what fb49c455323ff is about. > > > > So Suren's patch was essentially restoring the _existing_ forking > > behaviour, and > > now you're saying 'let's change the forking behaviour that's been like that > > for > > forever'. > > > I am afraid not. Before we introduced the per-VMA lock, we > were not performing I/O while holding `mmap_lock`. A page fault > that needed I/O would drop the `mmap_lock` read lock and allow > `fork()` to proceed. > > Now, you are suggesting performing I/O while holding the VMA > lock, which changes the requirements and introduces this > problem. > > > > > I think you would _really_ have to be sure that's safe. And forking is a > > very > > dangerous time in terms of complexity and sensitivity and 'weird stuff' > > happening so I'd tread _very_ carefully here. > > Yep. I think my original proposal did not require any changes > to `fork()`, since it simply preserved the current behavior of > dropping the VMA lock before performing I/O. In that model, > `fork()` would not end up waiting on I/O at all. > > What you are suggesting now appears to be performing I/O while > holding the VMA lock, which in turn introduces the need to > change `fork()`. > > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 2311ae7c2ff4..5ddaf297f31a 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -1762,7 +1762,13 @@ __latent_entropy int dup_mmap(struct mm_struct > > > *mm, struct mm_struct *oldmm) > > > for_each_vma(vmi, mpnt) { > > > struct file *file; > > > > > > - retval = vma_start_write_killable(mpnt); > > > + /* > > > + * For anonymous or writable private VMAs, prevent > > > + * concurrent CoW faults. > > > + */ > > > > To nit pick I think the comment's confusing but also tells you you don't > > need to > > specific anon check - writable private is sufficient. And it's not really > > just > > CoW that's the issue, it's anon_vma population _at all_ as well as CoW. > > > > > + if (!mpnt->vm_file || (!(mpnt->vm_flags & VM_SHARED) && > > > + (mpnt->vm_flags & VM_WRITE))) > > > + retval = vma_start_write_killable(mpnt); > > > > I think this has to be VM_MAYWRITE, because somebody could otherwise > > mprotect() > > it R/W. > > > > I also don't understand why !mpnt->vm_file for a read-only anon mapping > > (more > > likely PROT_NONE) is here, just do the second check? > > > > (Also please use the new interface, so !vma_test(mpnt, VMA_SHARED_BIT) && > > vma_test(mpnt, VMA_MAYWRITE_BIT)) > > Yep, I can definitely refine the check further. But before > doing that, I'd first like to confirm that we are aligned on > the direction. > > If you still intend to hold the VMA lock while performing I/O, > then I think we should fix `fork()` to avoid taking > `vma_start_write()`. > > > > > > if (retval < 0) > > > goto loop_out; > > > if (mpnt->vm_flags & VM_DONTCOPY) { > > > > > > Based on the above, we may want to re-check whether fork() > > > can be blocked by page faults. At the same time, if Suren, > > > you, or anyone else has any comments, please feel free to > > > share them. > > > > > > Best Regards > > > Barry > > > > Technical commentary above is sort of 'just cos' :) because I really > > question > > doing this honestly. > > I think we either need to fix `fork()`, or keep the current > behavior of dropping the VMA lock before performing I/O.
I see. So, this problem arises from the fact that we are changing the pagefaults requiring I/O operation to hold VMA lock... And you want to lock VMA on fork only if vma_is_anonymous(vma) || is_cow_mapping(vma->vm_flags). So, we will be blocking page faults for anonymous and COW VMAs only while holding mmap_write_lock, preventing any VMA modification. On the surface, that looks ok to me but I might be missing some corner cases. If nobody sees any obvious issues, I think it's worth a try. > > > > > I'd also like to get Suren's input, however. > > Yes. of course. > > > > > Thanks, Lorenzo > > Best Regards > Barry
