On Fri, Feb 25, 2022 at 01:33:33PM -0800, John Hubbard wrote: > On 2/25/22 13:23, Theodore Ts'o wrote: > > [un]pin_user_pages_remote is dirtying pages without properly warning > > the file system in advance. This was noted by Jan Kara in 2018[1] and > > In 2018, [un]pin_user_pages_remote did not exist. And so what Jan reported > was actually that dio_bio_complete() was calling set_page_dirty_lock() > on pages that were not (any longer) set up for that.
Fair enough, there are two problems that are getting conflated here, and that's my bad. The problem which Jan pointed out is one where the Direct I/O read path triggered a page fault, so page_mkwrite() was actually called. So in this case, the file system was actually notified, and the page was marked dirty after the file system was notified. But then the DIO read was racing with the page cleaner, which would call writepage(), and then clear the page, and then remove the buffer_heads. Then dio_bio_complete() would call set_page_dirty() a second time, and that's what would trigger the BUG. But in the syzbot reproducer, it's a different problem. In this case, process_vm_writev() calling [un]pin_user_pages_remote(), and page_mkwrite() is never getting called. So there is no need to race with the page cleaner, and so the BUG triggers much more reliably. > > more recently has resulted in bug reports by Syzbot in various Android > > kernels[2]. > > > > This is technically a bug in mm/gup.c, but arguably ext4 is fragile in > > Is it, really? unpin_user_pages_dirty_lock() moved the set_page_dirty_lock() > call into mm/gup.c, but that merely refactored things. The callers are > all over the kernel, and those callers are what need changing in order > to fix this. >From my perspective, the bug is calling set_page_dirty() without first calling the file system's page_mkwrite(). This is necessary since the file system needs to allocate file system data blocks in preparation for a future writeback. Now, calling page_mkwrite() by itself is not enough, since the moment you make the page dirty, the page cleaner could go ahead and call writepage() behind your back and clean it. In actual practice, with a Direct I/O read request racing with writeback, this is race was quite hard to hit, because the that would imply that the background writepage() call would have to complete ahead of the synchronous read request, and the block layer generally prioritizes synchronous reads ahead of background write requests. So in practice, this race was ***very*** hard to hit. Jan may have reported it in 2018, but I don't think I've ever seen it happen myself. For process_vm_writev() this is a case where user pages are pinned and then released in short order, so I suspect that race with the page cleaner would also be very hard to hit. But we could completely remove the potential for the race, and also make things kinder for f2fs and btrfs's compressed file write support, by making things work much like the write(2) system call. Imagine if we had a "pin_user_pages_local()" which calls write_begin(), and a "unpin_user_pages_local()" which calls write_end(), and the presumption with the "[un]pin_user_pages_local" API is that you don't hold the pinned pages for very long --- say, not across a system call boundary, and then it would work the same way the write(2) system call works does except that in the case of process_vm_writev(2) the pages are identified by another process's address space where they happen to be mapped. This obviously doesn't work when pinning pages for remote DMA, because in that case the time between pin_user_pages_remote() and unpin_user_pages_remote() could be a long, long time, so that means we can't use using write_begin/write_end; we'd need to call page_mkwrite() when the pages are first pinned and then somehow prevent the page cleaner from touching a dirty page which is pinned for use by the remote DMA. Does that make sense? - Ted