Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-27 Thread Linus Torvalds
On Tue, Jul 27, 2021 at 4:14 AM Andreas Gruenbacher  wrote:
>
> On Tue, Jul 27, 2021 at 11:30 AM David Laight  wrote:
> >
> > Is it actually worth doing any more than ensuring the first byte
> > of the buffer is paged in before entering the block that has
> > to disable page faults?
>
> We definitely do want to process as many pages as we can, especially
> if allocations are involved during a write.

Yeah, from an efficiency standpoint, once you start walking page
tables, it's probably best to just handle as much as you can.

But once you get an error, I don't think it should be "everything is bad".

This is a bit annoying, because while *most* users really just want
that "everything is good", *some* users might just want to handle the
partial success case.

It's why "copy_to/from_user()" returns the number of bytes *not*
written, rather than -EFAULT like get/put_user(). 99% of all users
just want to know "did I write all bytes" (and then checking for a
zero return is a simple and cheap verification of "everything was
ok").

But then very occasionally, you hit a case where you actually want to
know how much of a copy worked. It's rare, but it happens, and the
read/write system calls tend to be the main user of it.

And yes, the fact that "copy_to/from_user()" doesn't return an error
(like get/put_user() does) has confused people many times over the
years. It's annoying, but it's required by those (few) users that
really do want to handle that partial case.

I think this iov_iter_fault_in_readable/writeable() case should do the same.

And no, it's not new to Andreas' patch. iov_iter_fault_in_readable()
is doing the "everything has to be good" thing already.

Which maybe implies that nobody cares about partial reads/writes. Or
it's very very rare - I've seen code that handles page faults in user
space, but it's admittedly been some very special CPU
simulator/emulator checkpointing stuff.

   Linus



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-27 Thread Andreas Gruenbacher
On Tue, Jul 27, 2021 at 11:30 AM David Laight  wrote:
> From: Linus Torvalds
> > Sent: 24 July 2021 20:53
> >
> > On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
> >  wrote:
> > >
> > > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> > > +{
> > ...
> > > +   if (fault_in_user_pages(start, len, true) != len)
> > > +   return -EFAULT;
> >
> > Looking at this once more, I think this is likely wrong.
> >
> > Why?
> >
> > Because any user can/should only care about at least *part* of the
> > area being writable.
> >
> > Imagine that you're doing a large read. If the *first* page is
> > writable, you should still return the partial read, not -EFAULT.
>
> My 2c...
>
> Is it actually worth doing any more than ensuring the first byte
> of the buffer is paged in before entering the block that has
> to disable page faults?

We definitely do want to process as many pages as we can, especially
if allocations are involved during a write.

> Most of the all the pages are present so the IO completes.

That's not guaranteed. There are cases in which none of the pages are
present, and then there are cases in which only the first page is
present (for example, because of a previous access that wasn't page
aligned).

> The pages can always get unmapped (due to page pressure or
> another application thread unmapping them) so there needs
> to be a retry loop.
> Given the cost of actually faulting in a page going around
> the outer loop may not matter.
> Indeed, if an application has just mmap()ed in a very large
> file and is then doing a write() from it then it is quite
> likely that the pages got unmapped!
>
> Clearly there needs to be extra code to ensure progress is made.
> This might actually require the use of 'bounce buffers'
> for really problematic user requests.

I'm not sure if repeated unmapping of the pages that we've just
faulted in is going to be a problem (in terms of preventing progress).
But a suitable heuristic might be to shrink the fault-in "window" on
each retry until it's only one page.

> I also wonder what actually happens for pipes and fifos.
> IIRC reads and write of up to PIPE_MAX (typically 4096)
> are expected to be atomic.
> This should be true even if there are page faults part way
> through the copy_to/from_user().
>
> It has to be said I can't see any reference to PIPE_MAX
> in the linux man pages, but I'm sure it is in the POSIX/TOG
> spec.
>
> David

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-27 Thread David Laight
From: Linus Torvalds
> Sent: 24 July 2021 20:53
> 
> On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
>  wrote:
> >
> > +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> > +{
> ...
> > +   if (fault_in_user_pages(start, len, true) != len)
> > +   return -EFAULT;
> 
> Looking at this once more, I think this is likely wrong.
> 
> Why?
> 
> Because any user can/should only care about at least *part* of the
> area being writable.
> 
> Imagine that you're doing a large read. If the *first* page is
> writable, you should still return the partial read, not -EFAULT.

My 2c...

Is it actually worth doing any more than ensuring the first byte
of the buffer is paged in before entering the block that has
to disable page faults?

Most of the all the pages are present so the IO completes.

The pages can always get unmapped (due to page pressure or
another application thread unmapping them) so there needs
to be a retry loop.
Given the cost of actually faulting in a page going around
the outer loop may not matter.
Indeed, if an application has just mmap()ed in a very large
file and is then doing a write() from it then it is quite
likely that the pages got unmapped!

Clearly there needs to be extra code to ensure progress is made.
This might actually require the use of 'bounce buffers'
for really problematic user requests.

I also wonder what actually happens for pipes and fifos.
IIRC reads and write of up to PIPE_MAX (typically 4096)
are expected to be atomic.
This should be true even if there are page faults part way
through the copy_to/from_user().

It has to be said I can't see any reference to PIPE_MAX
in the linux man pages, but I'm sure it is in the POSIX/TOG
spec.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Al Viro
On Sun, Jul 25, 2021 at 12:06:41AM +0200, Andreas Gruenbacher wrote:
> On Sat, Jul 24, 2021 at 11:57 PM Al Viro  wrote:
> > On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote:
> >
> > > Hmm, how could we have sub-page failure areas when this is about if
> > > and how pages are mapped? If we return the number of bytes that are
> > > accessible, then users will know if they got nothing, something, or
> > > everything, and they can act accordingly.
> >
> > What I'm saying is that in situation when you have cacheline-sized
> > poisoned areas, there's no way to get an accurate count of readable
> > area other than try and copy it out.
> >
> > What's more, "something" is essentially useless information - the
> > pages might get unmapped right as your function returns; the caller
> > still needs to deal with partial copies.  And that's a slow path
> > by definition, so informing them of a partial fault-in is not
> > going to be useful.
> >
> > As far as callers are concerned, it's "nothing suitable in the
> > beginning of the area" vs. "something might be accessible".
> 
> Yes, and the third case would be "something might be accessible, but
> not all of it". There probably are callers that give up when they
> don't have it all.

Who cares?  Again,
1) those callers *still* have to cope with copyin/copyout failures
halfway through.  Fully successful fault-in does not guarantee anything
whatsoever.  IOW, you won't get rid of any complexity that way.
2) earlier bailout in rare error case is not worth bothering with.
If you'd been given an iov_iter spanning an unmapped/unreadable/unwritable
area of user memory, it's either a fucking rare race with truncate() of
an mmapped file or a pilot error.  Neither case is worth optimizing for.

The difference between partially accessible and completely accessible
at the fault-in time is useless for callers.  Really.



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Andreas Gruenbacher
On Sat, Jul 24, 2021 at 11:57 PM Al Viro  wrote:
> On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote:
>
> > Hmm, how could we have sub-page failure areas when this is about if
> > and how pages are mapped? If we return the number of bytes that are
> > accessible, then users will know if they got nothing, something, or
> > everything, and they can act accordingly.
>
> What I'm saying is that in situation when you have cacheline-sized
> poisoned areas, there's no way to get an accurate count of readable
> area other than try and copy it out.
>
> What's more, "something" is essentially useless information - the
> pages might get unmapped right as your function returns; the caller
> still needs to deal with partial copies.  And that's a slow path
> by definition, so informing them of a partial fault-in is not
> going to be useful.
>
> As far as callers are concerned, it's "nothing suitable in the
> beginning of the area" vs. "something might be accessible".

Yes, and the third case would be "something might be accessible, but
not all of it". There probably are callers that give up when they
don't have it all.

Andreas



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Al Viro
On Sat, Jul 24, 2021 at 11:38:20PM +0200, Andreas Gruenbacher wrote:

> Hmm, how could we have sub-page failure areas when this is about if
> and how pages are mapped? If we return the number of bytes that are
> accessible, then users will know if they got nothing, something, or
> everything, and they can act accordingly.

What I'm saying is that in situation when you have cacheline-sized
poisoned areas, there's no way to get an accurate count of readable
area other than try and copy it out.

What's more, "something" is essentially useless information - the
pages might get unmapped right as your function returns; the caller
still needs to deal with partial copies.  And that's a slow path
by definition, so informing them of a partial fault-in is not
going to be useful.

As far as callers are concerned, it's "nothing suitable in the
beginning of the area" vs. "something might be accessible".



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Andreas Gruenbacher
On Sat, Jul 24, 2021 at 10:24 PM Al Viro  wrote:
> On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
> > ...
> > > +   if (fault_in_user_pages(start, len, true) != len)
> > > +   return -EFAULT;
> >
> > Looking at this once more, I think this is likely wrong.
> >
> > Why?
> >
> > Because any user can/should only care about at least *part* of the
> > area being writable.
> >
> > Imagine that you're doing a large read. If the *first* page is
> > writable, you should still return the partial read, not -EFAULT.
>
> Agreed.
>
> > So I think the code needs to return 0 if _any_ fault was successful.
>
> s/any/the first/...
>
> The same goes for fault-in for read, of course; I've a half-baked conversion
> to such semantics (-EFAULT vs. 0; precise length is unreliable anyway,
> especially if you have sub-page failure areas), need to finish and post
> it...

Hmm, how could we have sub-page failure areas when this is about if
and how pages are mapped? If we return the number of bytes that are
accessible, then users will know if they got nothing, something, or
everything, and they can act accordingly.

Thanks,
Andreas



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Linus Torvalds
On Sat, Jul 24, 2021 at 1:26 PM Al Viro  wrote:
>
> On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
>
> > So I think the code needs to return 0 if _any_ fault was successful.
>
> s/any/the first/...

Yes, but as long as we do them in order, and stop when it fails, "any"
and "first" end up being the same thing ;)

> The same goes for fault-in for read

Yeah. That said, a partial write() (ie "read from user space") might
be something that a filesystem is not willing to touch for other
reasons, so I think returning -EFAULT in that case is, I think,
slightly more reasonable.

Things like "I have to prepare buffers to be filled" etc.

The partial read() case is at least something that a filesystem should
be able to handle fairly easily.

And I don't think returning -EFAULT early is necessarily wrong - we
obviously do it anyway if you give really invalid addresses.

But we have generally strived to allow partial IO for missing pages,
because people sometimes play games with unmapping things dynamically
or using mprotect() to catch modifications (ie doing things like catch
SIGSEGV/SIGBUS, and remap it).

But those kinds of uses tend to have to catch -EFAULT anyway, so I
guess it's not a big deal either way.

   Linus



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Al Viro
On Sat, Jul 24, 2021 at 12:52:34PM -0700, Linus Torvalds wrote:
> ...
> > +   if (fault_in_user_pages(start, len, true) != len)
> > +   return -EFAULT;
> 
> Looking at this once more, I think this is likely wrong.
> 
> Why?
> 
> Because any user can/should only care about at least *part* of the
> area being writable.
> 
> Imagine that you're doing a large read. If the *first* page is
> writable, you should still return the partial read, not -EFAULT.

Agreed.

> So I think the code needs to return 0 if _any_ fault was successful.

s/any/the first/...

The same goes for fault-in for read, of course; I've a half-baked conversion
to such semantics (-EFAULT vs. 0; precise length is unreliable anyway,
especially if you have sub-page failure areas), need to finish and post
it...



Re: [Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Linus Torvalds
On Sat, Jul 24, 2021 at 12:35 PM Andreas Gruenbacher
 wrote:
>
> +int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
> +{
...
> +   if (fault_in_user_pages(start, len, true) != len)
> +   return -EFAULT;

Looking at this once more, I think this is likely wrong.

Why?

Because any user can/should only care about at least *part* of the
area being writable.

Imagine that you're doing a large read. If the *first* page is
writable, you should still return the partial read, not -EFAULT.

So I think the code needs to return 0 if _any_ fault was successful.
Or perhaps return how much it was able to fault in. Because returning
-EFAULT if any of it failed seems wrong, and doesn't allow for partial
success being reported.

The other reaction I have is that you now only do the
iov_iter_fault_in_writeable, but then you make fault_in_user_pages()
still have that "bool write" argument.

We already have 'fault_in_pages_readable()', and that one is more
efficient (well, at least if the fault isn't needed it is). So it
would make more sense to just implement fault_in_pages_writable()
instead of that "fault_in_user_pages(, bool write)".

 Linus



[Cluster-devel] [PATCH v4 1/8] iov_iter: Introduce iov_iter_fault_in_writeable helper

2021-07-24 Thread Andreas Gruenbacher
Introduce a new iov_iter_fault_in_writeable helper for faulting in an iterator
for writing.  The pages are faulted in manually, without writing to them, so
this function is non-destructive.

We'll use iov_iter_fault_in_writeable in gfs2 once we've determined that part
of the iterator isn't in memory.

Signed-off-by: Andreas Gruenbacher 
---
 include/linux/mm.h  |  3 +++
 include/linux/uio.h |  1 +
 lib/iov_iter.c  | 40 +++
 mm/gup.c| 57 +
 4 files changed, 101 insertions(+)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7ca22e6e694a..14b1353995e2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1840,6 +1840,9 @@ int get_user_pages_fast(unsigned long start, int nr_pages,
 int pin_user_pages_fast(unsigned long start, int nr_pages,
unsigned int gup_flags, struct page **pages);
 
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+ bool write);
+
 int account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc);
 int __account_locked_vm(struct mm_struct *mm, unsigned long pages, bool inc,
struct task_struct *task, bool bypass_rlim);
diff --git a/include/linux/uio.h b/include/linux/uio.h
index 82c3c3e819e0..8e469b8b862f 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -120,6 +120,7 @@ size_t copy_page_from_iter_atomic(struct page *page, 
unsigned offset,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(const struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 20dc3d800573..ccf1ee8d4edf 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -460,6 +460,46 @@ int iov_iter_fault_in_readable(const struct iov_iter *i, 
size_t bytes)
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+/**
+ * iov_iter_fault_in_writeable - fault in an iov iterator for writing
+ * @i: iterator
+ * @bytes: maximum length
+ *
+ * Faults in part or all of the iterator.  This is primarily useful when we
+ * already know that some or all of the pages in @i aren't in memory.
+ *
+ * This function uses fault_in_user_pages() to fault in the pages, which
+ * internally uses get_user_pages(), so it doesn't trigger hardware page
+ * faults.  Unlike fault_in_pages_writeable() which writes to the memory to
+ * fault it in, this function is non-destructive.
+ *
+ * Returns 0 on success, and a non-zero error code if the memory could not be
+ * accessed (i.e. because it is an invalid address).
+ */
+int iov_iter_fault_in_writeable(const struct iov_iter *i, size_t bytes)
+{
+   if (iter_is_iovec(i)) {
+   const struct iovec *p;
+   size_t skip;
+
+   if (bytes > i->count)
+   bytes = i->count;
+   for (p = i->iov, skip = i->iov_offset; bytes; p++, skip = 0) {
+   unsigned long len = min(bytes, p->iov_len - skip);
+   unsigned long start;
+
+   if (unlikely(!len))
+   continue;
+   start = (unsigned long)p->iov_base + skip;
+   if (fault_in_user_pages(start, len, true) != len)
+   return -EFAULT;
+   bytes -= len;
+   }
+   }
+   return 0;
+}
+EXPORT_SYMBOL(iov_iter_fault_in_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
const struct iovec *iov, unsigned long nr_segs,
size_t count)
diff --git a/mm/gup.c b/mm/gup.c
index 42b8b1fa6521..784809c232f1 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -1669,6 +1669,63 @@ static long __get_user_pages_locked(struct mm_struct 
*mm, unsigned long start,
 }
 #endif /* !CONFIG_MMU */
 
+/**
+ * fault_in_user_pages - fault in an address range for reading / writing
+ * @start: start of address range
+ * @len: length of address range
+ * @write: fault in for writing
+ *
+ * Note that we don't pin or otherwise hold the pages referenced that we fault
+ * in.  There's no guarantee that they'll stay in memory for any duration of
+ * time.
+ *
+ * Returns the number of bytes faulted in from @start.
+ */
+unsigned long fault_in_user_pages(unsigned long start, unsigned long len,
+ bool write)
+{
+   struct mm_struct *mm = current->mm;
+   struct vm_area_struct *vma = NULL;
+   unsigned long end, nstart, nend;
+   int locked = 0;
+   int gup_flags;
+
+   gup_flags = FOLL_TOUCH | FOLL_POPULATE;
+   if (write)
+