Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Tue, Jul 27, 2021 at 4:14 AM Andreas Gruenbacher wrote: > > On Tue, Jul 27, 2021 at 11:30 AM David Laight wrote: > > > > Is it actually worth doing any more than ensuring the first byte > > of the buffer is paged in before entering the block that has > > to disable page faults? > > We definitely do want to process as many pages as we can, especially > if allocations are involved during a write. Yeah, from an efficiency standpoint, once you start walking page tables, it's probably best to just handle as much as you can. But once you get an error, I don't think it should be "everything is bad". This is a bit annoying, because while *most* users really just want that "everything is good", *some* users might just want to handle the partial success case. It's why "copy_to/from_user()" returns the number of bytes *not* written, rather than -EFAULT like get/put_user(). 99% of all users just want to know "did I write all bytes" (and then checking for a zero return is a simple and cheap verification of "everything was ok"). But then very occasionally, you hit a case where you actually want to know how much of a copy worked. It's rare, but it happens, and the read/write system calls tend to be the main user of it. And yes, the fact that "copy_to/from_user()" doesn't return an error (like get/put_user() does) has confused people many times over the years. It's annoying, but it's required by those (few) users that really do want to handle that partial case. I think this iov_iter_fault_in_readable/writeable() case should do the same. And no, it's not new to Andreas' patch. iov_iter_fault_in_readable() is doing the "everything has to be good" thing already. Which maybe implies that nobody cares about partial reads/writes. Or it's very very rare - I've seen code that handles page faults in user space, but it's admittedly been some very special CPU simulator/emulator checkpointing stuff. Linus
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Tue, Jul 27, 2021 at 11:30 AM David Laight wrote: > From: Linus Torvalds > > Sent: 24 July 2021 20:53 > > > > On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher > > wrote: > > > > > > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes) > > > +{ > > ... > > > + if (fault_in_user_pages(start, len, true) != len) > > > + return -EFAULT; > > > > Looking at this once more, I think this is likely wrong. > > > > Why? > > > > Because any user can/should only care about at least *part* of the > > area being writable. > > > > Imagine that you're doing a large read. If the *first* page is > > writable, you should still return the partial read, not -EFAULT. > > My 2c... > > Is it actually worth doing any more than ensuring the first byte > of the buffer is paged in before entering the block that has > to disable page faults? We definitely do want to process as many pages as we can, especially if allocations are involved during a write. > Most of the all the pages are present so the IO completes. That's not guaranteed. There are cases in which none of the pages are present, and then there are cases in which only the first page is present (for example, because of a previous access that wasn't page aligned). > The pages can always get unmapped (due to page pressure or > another application thread unmapping them) so there needs > to be a retry loop. > Given the cost of actually faulting in a page going around > the outer loop may not matter. > Indeed, if an application has just mmap()ed in a very large > file and is then doing a write() from it then it is quite > likely that the pages got unmapped! > > Clearly there needs to be extra code to ensure progress is made. > This might actually require the use of 'bounce buffers' > for really problematic user requests. I'm not sure if repeated unmapping of the pages that we've just faulted in is going to be a problem (in terms of preventing progress). But a suitable heuristic might be to shrink the fault-in "window" on each retry until it's only one page. > I also wonder what actually happens for pipes and fifos. > IIRC reads and write of up to PIPE_MAX (typically 4096) > are expected to be atomic. > This should be true even if there are page faults part way > through the copy_to/from_user(). > > It has to be said I can't see any reference to PIPE_MAX > in the linux man pages, but I'm sure it is in the POSIX/TOG > spec. > > David Thanks, Andreas
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
From: Linus Torvalds > Sent: 24 July 2021 20:53 > > On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher > wrote: > > > > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes) > > +{ > ... > > + if (fault_in_user_pages(start, len, true) != len) > > + return -EFAULT; > > Looking at this once more, I think this is likely wrong. > > Why? > > Because any user can/should only care about at least *part* of the > area being writable. > > Imagine that you're doing a large read. If the *first* page is > writable, you should still return the partial read, not -EFAULT. My 2c... Is it actually worth doing any more than ensuring the first byte of the buffer is paged in before entering the block that has to disable page faults? Most of the all the pages are present so the IO completes. The pages can always get unmapped (due to page pressure or another application thread unmapping them) so there needs to be a retry loop. Given the cost of actually faulting in a page going around the outer loop may not matter. Indeed, if an application has just mmap()ed in a very large file and is then doing a write() from it then it is quite likely that the pages got unmapped! Clearly there needs to be extra code to ensure progress is made. This might actually require the use of 'bounce buffers' for really problematic user requests. I also wonder what actually happens for pipes and fifos. IIRC reads and write of up to PIPE_MAX (typically 4096) are expected to be atomic. This should be true even if there are page faults part way through the copy_to/from_user(). It has to be said I can't see any reference to PIPE_MAX in the linux man pages, but I'm sure it is in the POSIX/TOG spec. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sun, Jul 25, 2021 at 12:06:41AM +0200, Andreas Gruenbacher wrote: > On Sat, Jul 24, 2021 at 11:57 PM Al Viro wrote: > > On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote: > > > > > Hmm, how could we have sub-page failure areas when this is about if > > > and how pages are mapped? If we return the number of bytes that are > > > accessible, then users will know if they got nothing, something, or > > > everything, and they can act accordingly. > > > > What I'm saying is that in situation when you have cacheline-sized > > poisoned areas, there's no way to get an accurate count of readable > > area other than try and copy it out. > > > > What's more, "something" is essentially useless information - the > > pages might get unmapped right as your function returns; the caller > > still needs to deal with partial copies. And that's a slow path > > by definition, so informing them of a partial fault-in is not > > going to be useful. > > > > As far as callers are concerned, it's "nothing suitable in the > > beginning of the area" vs. "something might be accessible". > > Yes, and the third case would be "something might be accessible, but > not all of it". There probably are callers that give up when they > don't have it all. Who cares? Again, 1) those callers *still* have to cope with copyin/copyout failures halfway through. Fully successful fault-in does not guarantee anything whatsoever. IOW, you won't get rid of any complexity that way. 2) earlier bailout in rare error case is not worth bothering with. If you'd been given an iov_iter spanning an unmapped/unreadable/unwritable area of user memory, it's either a fucking rare race with truncate() of an mmapped file or a pilot error. Neither case is worth optimizing for. The difference between partially accessible and completely accessible at the fault-in time is useless for callers. Really.
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 11:57 PM Al Viro wrote: > On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote: > > > Hmm, how could we have sub-page failure areas when this is about if > > and how pages are mapped? If we return the number of bytes that are > > accessible, then users will know if they got nothing, something, or > > everything, and they can act accordingly. > > What I'm saying is that in situation when you have cacheline-sized > poisoned areas, there's no way to get an accurate count of readable > area other than try and copy it out. > > What's more, "something" is essentially useless information - the > pages might get unmapped right as your function returns; the caller > still needs to deal with partial copies. And that's a slow path > by definition, so informing them of a partial fault-in is not > going to be useful. > > As far as callers are concerned, it's "nothing suitable in the > beginning of the area" vs. "something might be accessible". Yes, and the third case would be "something might be accessible, but not all of it". There probably are callers that give up when they don't have it all. Andreas
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote: > Hmm, how could we have sub-page failure areas when this is about if > and how pages are mapped? If we return the number of bytes that are > accessible, then users will know if they got nothing, something, or > everything, and they can act accordingly. What I'm saying is that in situation when you have cacheline-sized poisoned areas, there's no way to get an accurate count of readable area other than try and copy it out. What's more, "something" is essentially useless information - the pages might get unmapped right as your function returns; the caller still needs to deal with partial copies. And that's a slow path by definition, so informing them of a partial fault-in is not going to be useful. As far as callers are concerned, it's "nothing suitable in the beginning of the area" vs. "something might be accessible".
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 10:24 PM Al Viro wrote: > On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote: > > ... > > > + if (fault_in_user_pages(start, len, true) != len) > > > + return -EFAULT; > > > > Looking at this once more, I think this is likely wrong. > > > > Why? > > > > Because any user can/should only care about at least *part* of the > > area being writable. > > > > Imagine that you're doing a large read. If the *first* page is > > writable, you should still return the partial read, not -EFAULT. > > Agreed. > > > So I think the code needs to return 0 if _any_ fault was successful. > > s/any/the first/... > > The same goes for fault-in for read, of course; I've a half-baked conversion > to such semantics (-EFAULT vs. 0; precise length is unreliable anyway, > especially if you have sub-page failure areas), need to finish and post > it... Hmm, how could we have sub-page failure areas when this is about if and how pages are mapped? If we return the number of bytes that are accessible, then users will know if they got nothing, something, or everything, and they can act accordingly. Thanks, Andreas
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 1:26 PM Al Viro wrote: > > On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote: > > > So I think the code needs to return 0 if _any_ fault was successful. > > s/any/the first/... Yes, but as long as we do them in order, and stop when it fails, "any" and "first" end up being the same thing ;) > The same goes for fault-in for read Yeah. That said, a partial write() (ie "read from user space") might be something that a filesystem is not willing to touch for other reasons, so I think returning -EFAULT in that case is, I think, slightly more reasonable. Things like "I have to prepare buffers to be filled" etc. The partial read() case is at least something that a filesystem should be able to handle fairly easily. And I don't think returning -EFAULT early is necessarily wrong - we obviously do it anyway if you give really invalid addresses. But we have generally strived to allow partial IO for missing pages, because people sometimes play games with unmapping things dynamically or using mprotect() to catch modifications (ie doing things like catch SIGSEGV/SIGBUS, and remap it). But those kinds of uses tend to have to catch -EFAULT anyway, so I guess it's not a big deal either way. Linus
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote: > ... > > + if (fault_in_user_pages(start, len, true) != len) > > + return -EFAULT; > > Looking at this once more, I think this is likely wrong. > > Why? > > Because any user can/should only care about at least *part* of the > area being writable. > > Imagine that you're doing a large read. If the *first* page is > writable, you should still return the partial read, not -EFAULT. Agreed. > So I think the code needs to return 0 if _any_ fault was successful. s/any/the first/... The same goes for fault-in for read, of course; I've a half-baked conversion to such semantics (-EFAULT vs. 0; precise length is unreliable anyway, especially if you have sub-page failure areas), need to finish and post it...
Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher wrote: > > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes) > +{ ... > + if (fault_in_user_pages(start, len, true) != len) > + return -EFAULT; Looking at this once more, I think this is likely wrong. Why? Because any user can/should only care about at least *part* of the area being writable. Imagine that you're doing a large read. If the *first* page is writable, you should still return the partial read, not -EFAULT. So I think the code needs to return 0 if _any_ fault was successful. Or perhaps return how much it was able to fault in. Because returning -EFAULT if any of it failed seems wrong, and doesn't allow for partial success being reported. The other reaction I have is that you now only do the iov_iter_fault_in_writeable, but then you make fault_in_user_pages() still have that "bool write" argument. We already have 'fault_in_pages_readable()', and that one is more efficient (well, at least if the fault isn't needed it is). So it would make more sense to just implement fault_in_pages_writable() instead of that "fault_in_user_pages(, bool write)". Linus
[Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper
Introduce a new iov_iter_fault_in_writeable helper for faulting in an iterator for writing. The pages are faulted in manually, without writing to them, so this function is non-destructive. We'll use iov_iter_fault_in_writeable in gfs2 once we've determined that part of the iterator isn't in memory. Signed-off-by: Andreas Gruenbacher --- include/linux/mm.h | 3 +++ include/linux/uio.h | 1 + lib/iov_iter.c | 40 +++ mm/gup.c| 57 + 4 files changed, 101 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 7ca22e6e694a..14b1353995e2 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1840,6 +1840,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int pin_user_pages_fast(unsigned long start, int nr_pages, unsigned int gup_flags, struct page **pages); +unsigned long fault_in_user_pages(unsigned long start, unsigned long len, + bool write); + int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc); int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc, struct task_struct *task, bool bypass_rlim); diff --git a/include/linux/uio.h b/include/linux/uio.h index 82c3c3e819e0..8e469b8b862f 100644 --- a/include/linux/uio.h +++ b/include/linux/uio.h @@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, void iov_iter_advance(struct iov_iter *i, size_t bytes); void iov_iter_revert(struct iov_iter *i, size_t bytes); int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes); +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes); size_t iov_iter_single_seg_count(const struct iov_iter *i); size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes, struct iov_iter *i); diff --git a/lib/iov_iter.c b/lib/iov_iter.c index 20dc3d800573..ccf1ee8d4edf 100644 --- a/lib/iov_iter.c +++ b/lib/iov_iter.c @@ -460,6 +460,46 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes) } EXPORT_SYMBOL(iov_iter_fault_in_readable); +/** + * iov_iter_fault_in_writeable - fault in an iov iterator for writing + * @i: iterator + * @bytes: maximum length + * + * Faults in part or all of the iterator. This is primarily useful when we + * already know that some or all of the pages in @i aren't in memory. + * + * This function uses fault_in_user_pages() to fault in the pages, which + * internally uses get_user_pages(), so it doesn't trigger hardware page + * faults. Unlike fault_in_pages_writeable() which writes to the memory to + * fault it in, this function is non-destructive. + * + * Returns 0 on success, and a non-zero error code if the memory could not be + * accessed (i.e. because it is an invalid address). + */ +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes) +{ + if (iter_is_iovec(i)) { + const struct iovec *p; + size_t skip; + + if (bytes > i->count) + bytes = i->count; + for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) { + unsigned long len = min(bytes, p->iov_len - skip); + unsigned long start; + + if (unlikely(!len)) + continue; + start = (unsigned long)p->iov_base + skip; + if (fault_in_user_pages(start, len, true) != len) + return -EFAULT; + bytes -= len; + } + } + return 0; +} +EXPORT_SYMBOL(iov_iter_fault_in_writeable); + void iov_iter_init(struct iov_iter *i, unsigned int direction, const struct iovec *iov, unsigned long nr_segs, size_t count) diff --git a/mm/gup.c b/mm/gup.c index 42b8b1fa6521..784809c232f1 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -1669,6 +1669,63 @@ static long __get_user_pages_locked(struct mm_struct *mm, unsigned long start, } #endif /* !CONFIG_MMU */ +/** + * fault_in_user_pages - fault in an address range for reading / writing + * @start: start of address range + * @len: length of address range + * @write: fault in for writing + * + * Note that we don't pin or otherwise hold the pages referenced that we fault + * in. There's no guarantee that they'll stay in memory for any duration of + * time. + * + * Returns the number of bytes faulted in from @start. + */ +unsigned long fault_in_user_pages(unsigned long start, unsigned long len, + bool write) +{ + struct mm_struct *mm = current->mm; + struct vm_area_struct *vma = NULL; + unsigned long end, nstart, nend; + int locked = 0; + int gup_flags; + + gup_flags = FOLL_TOUCH | FOLL_POPULATE; + if (write) +