[PATCH v2 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-04 Thread Al Viro
[now without a descriptor leak; it really needs testing, though]

Using drm_gem_prime_handle_to_fd() to set dmabuf up and insert it into
descriptor table, only to have it looked up by file descriptor and
remove it from descriptor table is not just too convoluted - it's
racy; another thread might have modified the descriptor table while
we'd been going through that song and dance.

It's not hard to fix - turn drm_gem_prime_handle_to_fd()
into a wrapper for a new helper that would simply return the
dmabuf, without messing with descriptor table.

Then kfd_mem_export_dmabuf() would simply use that new helper
and leave the descriptor table alone.

Signed-off-by: Al Viro 
---
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c 
b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
index 8975cf41a91a..793780bb819c 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd_gpuvm.c
@@ -25,7 +25,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 
 #include 
@@ -812,18 +811,13 @@ static int kfd_mem_export_dmabuf(struct kgd_mem *mem)
if (!mem->dmabuf) {
struct amdgpu_device *bo_adev;
struct dma_buf *dmabuf;
-   int r, fd;
 
bo_adev = amdgpu_ttm_adev(mem->bo->tbo.bdev);
-   r = drm_gem_prime_handle_to_fd(_adev->ddev, 
bo_adev->kfd.client.file,
+   dmabuf = drm_gem_prime_handle_to_dmabuf(_adev->ddev, 
bo_adev->kfd.client.file,
   mem->gem_handle,
mem->alloc_flags & KFD_IOC_ALLOC_MEM_FLAGS_WRITABLE ?
-  DRM_RDWR : 0, );
-   if (r)
-   return r;
-   dmabuf = dma_buf_get(fd);
-   close_fd(fd);
-   if (WARN_ON_ONCE(IS_ERR(dmabuf)))
+  DRM_RDWR : 0);
+   if (IS_ERR(dmabuf))
return PTR_ERR(dmabuf);
mem->dmabuf = dmabuf;
}
diff --git a/drivers/gpu/drm/drm_prime.c b/drivers/gpu/drm/drm_prime.c
index 03bd3c7bd0dc..467c7a278ad3 100644
--- a/drivers/gpu/drm/drm_prime.c
+++ b/drivers/gpu/drm/drm_prime.c
@@ -409,23 +409,9 @@ static struct dma_buf *export_and_register_object(struct 
drm_device *dev,
return dmabuf;
 }
 
-/**
- * drm_gem_prime_handle_to_fd - PRIME export function for GEM drivers
- * @dev: dev to export the buffer from
- * @file_priv: drm file-private structure
- * @handle: buffer handle to export
- * @flags: flags like DRM_CLOEXEC
- * @prime_fd: pointer to storage for the fd id of the create dma-buf
- *
- * This is the PRIME export function which must be used mandatorily by GEM
- * drivers to ensure correct lifetime management of the underlying GEM object.
- * The actual exporting from GEM object to a dma-buf is done through the
- * _gem_object_funcs.export callback.
- */
-int drm_gem_prime_handle_to_fd(struct drm_device *dev,
+struct dma_buf *drm_gem_prime_handle_to_dmabuf(struct drm_device *dev,
   struct drm_file *file_priv, uint32_t handle,
-  uint32_t flags,
-  int *prime_fd)
+  uint32_t flags)
 {
struct drm_gem_object *obj;
int ret = 0;
@@ -434,14 +420,14 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
mutex_lock(_priv->prime.lock);
obj = drm_gem_object_lookup(file_priv, handle);
if (!obj)  {
-   ret = -ENOENT;
+   dmabuf = ERR_PTR(-ENOENT);
goto out_unlock;
}
 
dmabuf = drm_prime_lookup_buf_by_handle(_priv->prime, handle);
if (dmabuf) {
get_dma_buf(dmabuf);
-   goto out_have_handle;
+   goto out;
}
 
mutex_lock(>object_name_lock);
@@ -463,7 +449,6 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
/* normally the created dma-buf takes ownership of the ref,
 * but if that fails then drop the ref
 */
-   ret = PTR_ERR(dmabuf);
mutex_unlock(>object_name_lock);
goto out;
}
@@ -478,34 +463,51 @@ int drm_gem_prime_handle_to_fd(struct drm_device *dev,
ret = drm_prime_add_buf_handle(_priv->prime,
   dmabuf, handle);
mutex_unlock(>object_name_lock);
-   if (ret)
-   goto fail_put_dmabuf;
-
-out_have_handle:
-   ret = dma_buf_fd(dmabuf, flags);
-   /*
-* We must _not_ remove the buffer from the handle cache since the newly
-* created dma buf is already linked in the global obj->dma_buf pointer,
-* and that is invariant as long as a userspace gem handle exists.
-* Closing the handle will clean out the cache anyway, so we don't leak.

Re: [PATCH 1/2][RFC] amdgpu: fix a race in kfd_mem_export_dmabuf()

2024-06-04 Thread Al Viro
On Tue, Jun 04, 2024 at 02:08:30PM -0400, Felix Kuehling wrote:
> > +int drm_gem_prime_handle_to_fd(struct drm_device *dev,
> > +  struct drm_file *file_priv, uint32_t handle,
> > +  uint32_t flags,
> > +  int *prime_fd)
> > +{
> > +   struct dma_buf *dmabuf;
> > +   int fd = get_unused_fd_flags(flags);
> > +
> > +   if (fd < 0)
> > +   return fd;
> > +
> > +   dmabuf = drm_gem_prime_handle_to_dmabuf(dev, file_priv, handle, flags);
> > +   if (IS_ERR(dmabuf))
> > +   return PTR_ERR(dmabuf);

That ought to be
if (IS_ERR(dmabuf)) {
put_unused_fd(fd);
return PTR_ERR(dmabuf);
}
or we'll leak an allocated descriptor.  Will fix and repost...


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-05 Thread Al Viro
On Sun, May 05, 2024 at 01:03:07PM -0700, Linus Torvalds wrote:
> On Sun, 5 May 2024 at 12:46, Al Viro  wrote:
> >
> > I've no problem with having epoll grab a reference, but if we make that
> > a universal requirement ->poll() instances can rely upon,
> 
> Al, we're note "making that a requirement".
> 
> It always has been.

Argh.   We keep talking past each other.

0.  special-cased ->f_count rule for ->poll() is a wart and it's
better to get rid of it.

1.  fs/eventpoll.c is a steaming pile of shit and I'd be glad to see
git rm taken to it.  Short of that, by all means, let's grab reference
in there around the call of vfs_poll() (see (0)).

2.  having ->poll() instances grab extra references to file passed
to them is not something that should be encouraged; there's a plenty
of potential problems, and "caller has it pinned, so we are fine with
grabbing extra refs" is nowhere near enough to eliminate those.

3.  dma-buf uses of get_file() are probably safe (epoll shite aside),
but they do look fishy.  That has nothing to do with epoll.


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-05 Thread Al Viro
On Sun, May 05, 2024 at 09:46:05AM -0700, Linus Torvalds wrote:

> WHY?
> 
> Why cannot you and Al just admit that the problem is in epoll. Always
> has been, always will be.

Nobody (well, nobody who'd ever read epoll) argues that epoll is not
a problem.

> The fact is, it's not dma-buf that is violating any rules.

Now, that is something I've a trouble with.  Use of get_file() in there
actually looks rather fishy, regardless of epoll.

At the very least it needs a comment discouraging other instances from
blindly copying this.  A reference to struct file pins down more than
driver-internal objects; if nothing else, it pins down a mount and
if you don't have SB_NOUSER on file_inode(file)->i_sb->s_flags, it's
really not a good idea.

What's more, the reason for that get_file() is, AFAICS, that nodes
we put into callback queue for fence(s) in question[*] are embedded
into dmabuf and we don't want them gone before the callbacks have
happened.  Which looks fishy - it would make more sense to cancel
these callbacks and drop the fence(s) in question from ->release().

I've no problem whatsoever with fs/eventpoll.c grabbing/dropping
file reference around vfs_poll() calls.  And I don't believe that
"try to grab" has any place in dma_buf_poll(); it's just that I'm not
happy about get_file() call being there in the first place.

Sure, the call of ->poll() can bloody well lead to references being
grabbed - by the pollwait callback, which the caller of ->poll()
is aware of.  It's ->poll() instance *itself* grabbing such references
with vfs_poll() caller having no idea what's going on that has
potential for being unpleasant.  And we can't constify 'file' argument
of ->poll() because of poll_wait(), so it's hard to catch those who
do that kind of thing; I've explicitly said so upthread, I believe.

But similar calls of get_file() in ->poll() instances (again, not
the ones that are made by pollwait callback) are something to
watch out for and having the caller pin struct file does not solve
the problem.

[*] at most one per direction, and I've no idea whether there can be more
than one signalling fence for given dmabuf) 


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-05 Thread Al Viro
On Sat, May 04, 2024 at 08:53:47AM -0700, Linus Torvalds wrote:

>   poll_wait
> -> __pollwait
>  -> get_file (*boom*)
> 
> but the boom is very small because the poll_wait() will be undone by
> poll_freewait(), and normally poll/select has held the file count
> elevated.

Not quite.  It's not that poll_wait() calls __pollwait(); it calls
whatever callback that caller of ->poll() has set for it.

__pollwait users (select(2) and poll(2), currently) must (and do) make
sure that refcount is elevated; others (and epoll is not the only one)
need to do whatever's right for their callbacks.

I've no problem with having epoll grab a reference, but if we make that
a universal requirement ->poll() instances can rely upon, we'd better
verify that *all* vfs_poll() are OK.  And that ought to go into
Documentation/filesystems/porting.rst ("callers of vfs_poll() must
make sure that file is pinned; ->poll() instances may rely upon that,
but they'd better be very careful about grabbing extra references themselves -
it's acceptable for files on internal mounts, but *NOT* for anything on
mountable filesystems.  Any instance that does it needs an explicit
comment telling not to reuse that blindly." or something along those
lines).

Excluding epoll, select/poll and callers that have just done fdget() and will
do fdput() after vfs_poll(), we have this:

drivers/vhost/vhost.c:213:  mask = vfs_poll(file, >table);
vhost_poll_start().  Might get interesting...  Calls working
with vq->kick as file seem to rely upon vq->mutex, but I'll need to
refresh my memories of that code to check if that's all we need - and
then there's vhost_net_enable_vq(), which also needs an audit.

fs/aio.c:1738:  mask = vfs_poll(req->file, ) & req->events;
fs/aio.c:1932:  mask = vfs_poll(req->file, ) & req->events;
aio_poll() and aio_poll_wake() resp.  req->file here is actually 
->ki_filp
of iocb that contains work as iocb->req.work; it get dropped only in
iocb_destroy(), which also frees iocb.  Any call that might've run into
req->file not pinned is already in UAF land.

io_uring/poll.c:303:req->cqe.res = vfs_poll(req->file, ) 
& req->apoll_events;
io_uring/poll.c:622:mask = vfs_poll(req->file, >pt) & poll->events;
Should have req->file pinned, but I'll need to RTFS a bit for
details.  That, or ask Jens to confirm...

net/9p/trans_fd.c:236:  ret = vfs_poll(ts->rd, pt);
net/9p/trans_fd.c:238:  ret = (ret & ~EPOLLOUT) | (vfs_poll(ts->wr, pt) 
& ~EPOLLIN);
p9_fd_poll(); ->rd and ->wr are pinned and won't get dropped until
p9_fd_close(), which frees ts immediately afterwards.  IOW, if we risk
being called with ->rd or ->wr not pinned, we are in UAF land already.
Incidentally, what the hell is this in p9_fd_open()?
 * It's technically possible for userspace or concurrent mounts to
 * modify this flag concurrently, which will likely result in a
 * broken filesystem. However, just having bad flags here should
 * not crash the kernel or cause any other sort of bug, so mark this
 * particular data race as intentional so that tooling (like KCSAN)
 * can allow it and detect further problems.
 */
Why not simply fix the race instead?  As in
spin_lock(>rd->f_lock);
ts->rd->f_flags |= O_NONBLOCK;
spin_unlock(>rd->f_lock);
and similar for ts->wr?  Sigh...


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 04:16:15PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 15:07, Al Viro  wrote:
> >
> > Suppose your program calls select() on a pipe and dmabuf, sees data to be 
> > read
> > from pipe, reads it, closes both pipe and dmabuf and exits.
> >
> > Would you expect that dmabuf file would stick around for hell knows how long
> > after that?  I would certainly be very surprised by running into that...
> 
> Why?
> 
> That's the _point_ of refcounts. They make the thing they refcount
> stay around until it's no longer referenced.
> 
> Now, I agree that dmabuf's are a bit odd in how they use a 'struct
> file' *as* their refcount, but hey, it's a specialty use. Unusual
> perhaps, but not exactly wrong.
> 
> I suspect that if you saw a dmabuf just have its own 'refcount_t' and
> stay around until it was done, you wouldn't bat an eye at it, and it's
> really just the "it uses a struct file for counting" that you are
> reacting to.

*IF* those files are on purely internal filesystem, that's probably
OK; do that with something on something mountable (char device,
sysfs file, etc.) and you have a problem with filesystem staying
busy.

I'm really unfamiliar with the subsystem; it might be OK with all
objects that use that for ->poll(), but that's definitely not a good
thing to see in ->poll() instance in general.  And code gets copied,
so there really should be a big fat comment about the reasons why
it's OK in this particular case.

Said that, it seems that a better approach might be to have
their ->release() cancel callbacks and drop fence references.
Note that they *do* have refcounts - on fences.  The file
(well, dmabuf, really) is pinned only to protect against the
situation when pending callback is still around.  And Kees'
observation about multiple fences is also interesting - we don't
get extra fput(), but only because we get events only from one
fence, which does look fishy...


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 03:46:25PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
> > That means that the file will be released - and it means that you have
> > violated all the refcounting rules for poll().
> 
> I feel like I've been looking at this too long. I think I see another
> problem here, but with dmabuf even when epoll is fixed:
> 
> dma_buf_poll()
>   get_file(dmabuf->file)  /* f_count + 1 */
>   dma_buf_poll_add_cb()
>   dma_resv_for_each_fence ...
>   dma_fence_add_callback(fence, ..., dma_buf_poll_cb)
> 
> dma_buf_poll_cb()
>   ...
> fput(dmabuf->file);   /* f_count - 1 ... for each fence */
> 
> Isn't it possible to call dma_buf_poll_cb() (and therefore fput())
> multiple times if there is more than 1 fence? Perhaps I've missed a
> place where a single struct dma_resv will only ever signal 1 fence? But
> looking through dma_fence_signal_timestamp_locked(), I don't see
> anything about resv nor somehow looking into other fence cb_list
> contents...

At a guess,
r = dma_fence_add_callback(fence, >cb, dma_buf_poll_cb);
if (!r)
return true;

prevents that - it returns 0 on success and -E... on error;
insertion into the list happens only when it's returning 0,
so...


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 11:01:45PM +0100, Al Viro wrote:

> Having ->poll() instance itself grab reference is really asking for problem,
> even on the boxen that have CONFIG_EPOLL turned off.  select(2) is enough;
> it will take care of references grabbed by __pollwait(), but it doesn't
> know anything about the references driver has stashed hell knows where for
> hell knows how long.

Suppose your program calls select() on a pipe and dmabuf, sees data to be read
from pipe, reads it, closes both pipe and dmabuf and exits.

Would you expect that dmabuf file would stick around for hell knows how long
after that?  I would certainly be very surprised by running into that...


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 02:52:38PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:45, Al Viro  wrote:
> >
> > How do you get through eventpoll_release_file() while someone
> > has entered ep_item_poll()?
> 
> Doesn't matter.
> 
> Look, it's enough that the file count has gone down to zero. You may
> not even have gotten to eventpoll_release_file() yet - the important
> part is that you're on your *way* to it.
> 
> That means that the file will be released - and it means that you have
> violated all the refcounting rules for poll().
> 
> So I think you're barking up the wrong tree.

IMO there are several things in that mess (aside of epoll being what it is).

Trying to grab refcount as you do is fine; the comment is seriously
misleading, though - we *are* guaranteed that struct file hasn't hit 
->release(),
let alone getting freed and reused.

Having pollwait callback grab references is fine - and the callback belongs
to whoever's calling ->poll().

Having ->poll() instance itself grab reference is really asking for problem,
even on the boxen that have CONFIG_EPOLL turned off.  select(2) is enough;
it will take care of references grabbed by __pollwait(), but it doesn't
know anything about the references driver has stashed hell knows where for
hell knows how long.


Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 02:42:22PM -0700, Linus Torvalds wrote:
> On Fri, 3 May 2024 at 14:36, Al Viro  wrote:
> >
> > ... the last part is no-go - poll_wait() must be able to grab a reference
> > (well, the callback in it must)
> 
> Yeah. I really think that *poll* itself is doing everything right. It
> knows that it's called with a file pointer with a reference, and it
> adds its own references as needed.

Not really.  Note that select's __pollwait() does *NOT* leave a reference
at the mercy of driver - it's stuck into poll_table_entry->filp and
the poll_freewait() knows how to take those out.


dmabuf does something very different - it grabs the damn thing into
its private data structures and for all we know it could keep it for
a few hours, until some even materializes.


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 02:33:37PM -0700, Linus Torvalds wrote:

> Look at the hack in __ep_remove(): if it is concurrent with
> eventpoll_release_file(), it will hit this code
> 
> spin_lock(>f_lock);
> if (epi->dying && !force) {
> spin_unlock(>f_lock);
> return false;
> }
> 
> and not free the epi.

What does that have to do with ep_item_poll()?

eventpoll_release_file() itself calls __ep_remove().  Have that
happen while ep_item_poll() is running in another thread and
you've got a problem.

AFAICS, exclusion is on ep->mtx.  Callers of ep_item_poll() are
* __ep_eventpoll_poll() - grabs ->mtx
* ep_insert() - called under ->mtx
* ep_modify() - calls are under ->mtx
* ep_send_events() - grabs ->mtx

and eventpoll_release_file() grabs ->mtx around __ep_remove().

How do you get through eventpoll_release_file() while someone
has entered ep_item_poll()?


Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 10:11:09PM +0100, Al Viro wrote:
> On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
> > 
> > Is this the right approach? It still feels to me like get_file() needs
> > to happen much earlier...
> 
> I don't believe it needs to happen at all.  The problem is not that
> ->release() can be called during ->poll() - it can't and it doesn't.
> It's that this instance of ->poll() is trying to extend the lifetime
> of that struct file, when it might very well be past the point of no
> return.
> 
> What we need is
>   * promise that ep_item_poll() won't happen after 
> eventpoll_release_file().
> AFAICS, we do have that.
>   * ->poll() not playing silly buggers.
> 
> As it is, dma_buf ->poll() is very suspicious regardless of that
> mess - it can grab reference to file for unspecified interval.
> Have that happen shortly before reboot and you are asking for failing
> umount.
> 
> ->poll() must be refcount-neutral wrt file passed to it.  I'm seriously
> tempted to make ->poll() take const struct file * and see if there's
> anything else that would fall out.

... the last part is no-go - poll_wait() must be able to grab a reference
(well, the callback in it must)


Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 02:24:45PM -0700, Linus Torvalds wrote:
> Because even with perfectly normal "->poll()", and even with the
> ep_item_poll() happening *before* eventpoll_release_file(), you have
> this trivial race:
> 
>   ep_item_poll()
>  ->poll()
> 
> and *between* those two operations, another CPU does "close()", and
> that causes eventpoll_release_file() to be called, and now f_count
> goes down to zero while ->poll() is running.
> 
> So you do need to increment the file count around the ->poll() call, I feel.
> 
> Or, alternatively, you'd need to serialize with
> eventpoll_release_file(), but that would need to be some sleeping lock
> held over the ->poll() call.
> 
> > As it is, dma_buf ->poll() is very suspicious regardless of that
> > mess - it can grab reference to file for unspecified interval.
> 
> I think that's actually much preferable to what epoll does, which is
> to keep using files without having reference counts to them (and then
> relying on magically not racing with eventpoll_release_file().

eventpoll_release_file() calling __ep_remove() while ep_item_poll()
is something we need to avoid anyway - having epi freed under
ep_item_poll() would be a problem regardless of struct file
lifetime issues.


Re: [PATCH] epoll: try to be a _bit_ better about file lifetimes

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 02:11:30PM -0700, Linus Torvalds wrote:
> epoll is a mess, and does various invalid things in the name of
> performance.
> 
> Let's try to rein it in a bit. Something like this, perhaps?

> +/*
> + * The ffd.file pointer may be in the process of
> + * being torn down due to being closed, but we
> + * may not have finished eventpoll_release() yet.
> + *
> + * Technically, even with the atomic_long_inc_not_zero,
> + * the file may have been free'd and then gotten
> + * re-allocated to something else (since files are
> + * not RCU-delayed, they are SLAB_TYPESAFE_BY_RCU).

Can we get to ep_item_poll(epi, ...) after eventpoll_release_file()
got past __ep_remove()?  Because if we can, we have a worse problem -
epi freed under us.

If not, we couldn't possibly have reached ->release() yet, let
alone freeing anything.


Re: get_file() unsafe under epoll (was Re: [syzbot] [fs?] [io-uring?] general protection fault in __ep_remove)

2024-05-03 Thread Al Viro
On Fri, May 03, 2024 at 01:28:37PM -0700, Kees Cook wrote:
> 
> Is this the right approach? It still feels to me like get_file() needs
> to happen much earlier...

I don't believe it needs to happen at all.  The problem is not that
->release() can be called during ->poll() - it can't and it doesn't.
It's that this instance of ->poll() is trying to extend the lifetime
of that struct file, when it might very well be past the point of no
return.

What we need is
* promise that ep_item_poll() won't happen after 
eventpoll_release_file().
AFAICS, we do have that.
* ->poll() not playing silly buggers.

As it is, dma_buf ->poll() is very suspicious regardless of that
mess - it can grab reference to file for unspecified interval.
Have that happen shortly before reboot and you are asking for failing
umount.

->poll() must be refcount-neutral wrt file passed to it.  I'm seriously
tempted to make ->poll() take const struct file * and see if there's
anything else that would fall out.


Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-02 Thread Al Viro
On Thu, May 02, 2024 at 05:10:18PM -0700, Kees Cook wrote:

> But anyway, there needs to be a general "oops I hit 0"-aware form of
> get_file(), and it seems like it should just be get_file() itself...

... which brings back the question of what's the sane damage mitigation
for that.  Adding arseloads of never-exercised failure exits is generally
a bad idea - it's asking for bitrot and making the thing harder to review
in future.



Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-02 Thread Al Viro
On Thu, May 02, 2024 at 04:21:13PM -0700, Kees Cook wrote:
> On Fri, May 03, 2024 at 12:12:28AM +0100, Al Viro wrote:
> > On Thu, May 02, 2024 at 03:52:21PM -0700, Kees Cook wrote:
> > 
> > > As for semantics, what do you mean? Detecting dec-below-zero means we
> > > catch underflow, and detected inc-from-zero means we catch resurrection
> > > attempts. In both cases we avoid double-free, but we have already lost
> > > to a potential dangling reference to a freed struct file. But just
> > > letting f_count go bad seems dangerous.
> > 
> > Detected inc-from-zero can also mean an RCU lookup detecting a descriptor
> > in the middle of getting closed.  And it's more subtle than that, actually,
> > thanks to SLAB_TYPESAFE_BY_RCU for struct file.
> 
> But isn't that already handled by __get_file_rcu()? i.e. shouldn't it be
> impossible for a simple get_file() to ever see a 0 f_count under normal
> conditions?

For get_file() it is impossible.  The comment about semantics had been
about the sane ways to recover if such crap gets detected.

__get_file_rcu() is a separate story - consider the comment in there: 
* atomic_long_inc_not_zero() above provided a full memory
* barrier when we acquired a reference.
 *
 * This is paired with the write barrier from assigning to the
 * __rcu protected file pointer so that if that pointer still
 * matches the current file, we know we have successfully
 * acquired a reference to the right file.

and IIRC, refcount_t is weaker wrt barriers.



Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-02 Thread Al Viro
On Thu, May 02, 2024 at 03:52:21PM -0700, Kees Cook wrote:

> As for semantics, what do you mean? Detecting dec-below-zero means we
> catch underflow, and detected inc-from-zero means we catch resurrection
> attempts. In both cases we avoid double-free, but we have already lost
> to a potential dangling reference to a freed struct file. But just
> letting f_count go bad seems dangerous.

Detected inc-from-zero can also mean an RCU lookup detecting a descriptor
in the middle of getting closed.  And it's more subtle than that, actually,
thanks to SLAB_TYPESAFE_BY_RCU for struct file.



Re: [PATCH 5/5] fs: Convert struct file::f_count to refcount_long_t

2024-05-02 Thread Al Viro
On Thu, May 02, 2024 at 03:33:40PM -0700, Kees Cook wrote:
> Underflow of f_count needs to be more carefully detected than it
> currently is. The results of get_file() should be checked, but the
> first step is detection. Redefine f_count from atomic_long_t to
> refcount_long_t.

It is used on fairly hot paths.  What's more, it's not
at all obvious what the hell would right semantics be.

NAKed-by: Al Viro 



Re: [PATCH v2] fs: clean up usage of noop_dirty_folio

2023-08-28 Thread Al Viro
On Mon, Aug 28, 2023 at 03:54:49PM +0800, Xueshi Hu wrote:
> In folio_mark_dirty(), it can automatically fallback to
> noop_dirty_folio() if a_ops->dirty_folio is not registered.
> 
> As anon_aops, dev_dax_aops and fb_deferred_io_aops becames empty, remove
> them too.

I'd put the last sentence as 'In dev_dax_aops and fb_deferred_io_aops replacing
.dirty_folio with NULL makes them identical to default (empty_aops) and since
we never compare ->a_ops pointer with either of those, we can remove them
completely'.

There could've been places like
#define is_fb_deferred(mapping) (mapping)->a_ops == fb_deferred_io_aops
and those would've been broken by that.  The fact that there's nothing
of that sort in the tree ought to be mentioned in commit message.

Note that we *do* have places where method table comparisons are used
in predicates like that, so it's not a pure theory; sure, missing that
would've probably ended up with broken build, but that can easily be
dependent upon the config (and that, alas, is also not a pure theory -
BTDT).  In this case the change is correct, fortunately...

Other than that part of commit message -

Acked-by: Al Viro 



Re: [PATCH] dma-buf/sync_file: Use fdget()

2023-05-11 Thread Al Viro
On Fri, May 05, 2023 at 10:22:09AM +0200, Christian König wrote:
> Am 05.05.23 um 05:03 schrieb ye.xingc...@zte.com.cn:
> > From: Ye Xingchen 
> > 
> > convert the fget() use to fdget().
> 
> Well the rational is missing. Why should we do that?

We very definitely should not.  The series appears to be
pure cargo-culting and it's completely wrong.

There is such thing as unwarranted use of fget().  Under some
conditions converting to fdget() is legitimate *and* is an
improvement.  HOWEVER, those conditions are not met in this case.

Background: references in descriptor table do contribute to
struct file refcount.  fget() finds the reference by descriptor
and returns it, having bumped the refcount.  In case when
descriptor table is shared, we must do that - otherwise e.g.
close() or dup2() from another thread could very well have
destroyed the struct file we'd just found.  However, if
descriptor table is *NOT* shared, there's no need to mess
with refcount at all.  Provided that
* we are not grabbing the reference to keep it (stash
into some data structure, etc.); as soon as we return from
syscall, the reference in descriptor table is fair game for
e.g. close(2).  Or exit(2), for that matter.
* we remember whether it was shared or not - we can't
just recheck that when we are done with the file; after all,
descriptor table might have been shared when we looked the file up,
but another thread might've died since then and left it not
shared anymore.
* we do not rip the same reference out of our descriptor
table ourselves - not without seriously convoluted precautions.
Very few places in the kernel can lead to closing descriptors,
so in practice it only becomes a problem when a particularly
ugly ioctl decides that it would be neat to close some descriptor(s).
Example of such convolutions: binder_deferred_fd_close().

fdget() returns a pair that consists of struct file reference
*AND* indication whether we have grabbed a reference.  fdput()
takes such pair.

Both are inlined, and compiler is smart enough to split the
pair into two separate local variables.  The underlying
primitive actually stashes the "have grabbed the refcount"
into the LSB of returned word; see __to_fd() in include/linux/file.h
for details.  It really generates a decent code and a plenty of
places where we want a file by descriptor are just fine with it.

This patch is flat-out broken, since it loses the "have we bumped
the refcount" information - the callers do not get it.

It might be possible to massage the calling conventions to enable
the conversion to fdget(), but it's not obvious that result would
be cleaner and in any case, the patch in question doesn't even
try that.


Re: [PATCH] dma-buf: Use fdget()

2023-05-11 Thread Al Viro
On Fri, May 05, 2023 at 10:18:47AM +0800, ye.xingc...@zte.com.cn wrote:
> From: Ye Xingchen 
> 
> convert the fget() use to fdget().

NAK.


Re: [PATCH] dma-buf/sync_file: Use fdget()

2023-05-11 Thread Al Viro
On Fri, May 05, 2023 at 11:03:39AM +0800, ye.xingc...@zte.com.cn wrote:
> From: Ye Xingchen 
> 
> convert the fget() use to fdget().

NAK.


[PATCH 6/8] dma_buf: no need to bother with file_inode()->i_mapping

2022-08-20 Thread Al Viro
->f_mapping will do just fine

Signed-off-by: Al Viro 
---
 drivers/dma-buf/udmabuf.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/dma-buf/udmabuf.c b/drivers/dma-buf/udmabuf.c
index 38e8767ec371..210473d927d8 100644
--- a/drivers/dma-buf/udmabuf.c
+++ b/drivers/dma-buf/udmabuf.c
@@ -210,7 +210,7 @@ static long udmabuf_create(struct miscdevice *device,
memfd = fget(list[i].memfd);
if (!memfd)
goto err;
-   mapping = file_inode(memfd)->i_mapping;
+   mapping = memfd->f_mapping;
if (!shmem_mapping(mapping) && !is_file_hugepages(memfd))
goto err;
seals = memfd_fcntl(memfd, F_GET_SEALS, 0);
-- 
2.30.2



Re: [PATCH 5/6] dma-buf: remove useless FMODE_LSEEK flag

2022-06-24 Thread Al Viro
On Fri, Jun 24, 2022 at 06:56:30PM +0200, Jason A. Donenfeld wrote:
> This is already on by default.

ITYM "anon_inode_getfile() has already set it, since dma_buf_fops has
non-NULL ->llseek".


Re: [PATCH 4/9] drm: remove the drm file system

2021-03-10 Thread Al Viro
On Tue, Mar 09, 2021 at 04:53:43PM +0100, Christoph Hellwig wrote:
> Just use the generic anon_inode file system.

Are you changing the lifetime rules for that module?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 3/9] powerpc/pseries: remove the ppc-cmm file system

2021-03-10 Thread Al Viro
On Tue, Mar 09, 2021 at 04:53:42PM +0100, Christoph Hellwig wrote:
> Just use the generic anon_inode file system.

Umm...  The only problem I see here is the lifetime rules for
that module, and that's not something introduced in this patchset.
Said that, looks like the logics around that place is duplicated in
cmm.c, vmw_balloon.c and virtion_balloon.c and I wonder if it would
be better off with a helper in mm/balloon.c to be used for that setup...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RFC PKS/PMEM 33/58] fs/cramfs: Utilize new kmap_thread()

2020-10-14 Thread Al Viro
On Tue, Oct 13, 2020 at 08:36:43PM +0100, Matthew Wilcox wrote:

> static inline void copy_to_highpage(struct page *to, void *vfrom, unsigned 
> int size)
> {
>   char *vto = kmap_atomic(to);
> 
>   memcpy(vto, vfrom, size);
>   kunmap_atomic(vto);
> }
> 
> in linux/highmem.h ?

You mean, like
static void memcpy_from_page(char *to, struct page *page, size_t offset, size_t 
len)
{
char *from = kmap_atomic(page);
memcpy(to, from + offset, len);
kunmap_atomic(from);
}

static void memcpy_to_page(struct page *page, size_t offset, const char *from, 
size_t len)
{
char *to = kmap_atomic(page);
memcpy(to + offset, from, len);
kunmap_atomic(to);
}

static void memzero_page(struct page *page, size_t offset, size_t len)
{
char *addr = kmap_atomic(page);
memset(addr + offset, 0, len);
kunmap_atomic(addr);
}

in lib/iov_iter.c?  FWIW, I don't like that "highpage" in the name and
highmem.h as location - these make perfect sense regardless of highmem;
they are normal memory operations with page + offset used instead of
a pointer...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/2] dmabuf/tracing: Add dma-buf trace events

2020-08-05 Thread Al Viro
On Tue, Aug 04, 2020 at 03:44:51PM +, Kalesh Singh wrote:

> Hi Al. Thank you for the comments. Ultimately what we need is to identify 
> processes
> that hold a file reference to the dma-buf. Unfortunately we can't use only
> explicit dma_buf_get/dma_buf_put to track them because when an FD is being 
> shared
> between processes the file references are taken implicitly.
> 
> For example, on the sender side:
>unix_dgram_sendmsg -> send_scm -> __send_scm -> scm_fp_copy -> fget_raw
> and on the receiver side:
>unix_dgram_recvmsg -> scm_recv -> scm_detach_fds -> __scm_install_fd -> 
> get_file
> 
> I understand now that fd_install is not an appropriate abstraction level to 
> track these.
> Is there a more appropriate alternative where we could use to track these 
> implicit file
> references?

There is no single lock that would stabilize the descriptor tables of all
processes.  And there's not going to be one, ever - it would be a contention
point from hell, since that would've been a system-wide lock that would have
to be taken by *ALL* syscalls modifying any descriptor table.  Not going to
happen, for obvious reasons.  Moreover, you would have to have fork(2) take
the same lock, since it does copy descriptor table.  And clone(2) either does
the same, or has the child share the descriptor table of parent.

What's more, a reference to struct file can bloody well survive without
a single descriptor refering to that file.  In the example you've mentioned
above, sender has ever right to close all descriptors it has sent.   Files
will stay opened as long as the references are held in the datagram; when
that datagram is received, the references will be inserted into recepient's
descriptor table.  At that point you again have descriptors refering to
that file, can do any IO on it, etc.

So "the set of processes that hold a file reference to the dma-buf" is
* inherently unstable, unless you are willing to freeze every
process in the system except for the one trying to find that set.
* can remain empty for any amount of time (hours, weeks, whatever),
only to get non-empty later, with syscalls affecting the object in question
done afterwards.

So... what were you going to do with that set if you could calculate it?
If it's really "how do we debug a leak?", it's one thing; in that case
I would suggest keeping track of creation/destruction of objects (not
gaining/dropping references - actual constructors and destructors) to
see what gets stuck around for too long and use fuser(1) to try and locate
the culprits if you see that something *was* living for too long.  "Try"
since the only reference might indeed have been stashed into an SCM_RIGHTS
datagram sitting in a queue of some AF_UNIX socket.  Note that "fuser
needs elevated priveleges" is not a strong argument - the ability to
do that sort of tracking does imply elevated priveleges anyway, and
having a root process taking requests along the lines of "gimme the
list of PIDs that have such-and-such dma_buf in their descriptor table"
is not much of an attack surface.

If you want to use it for something else, you'll need to describe that
intended use; there might be sane ways to do that, but it's hard to
come up with one without knowing what's being attempted...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 1/2] fs: Add fd_install file operation

2020-08-04 Thread Al Viro
On Mon, Aug 03, 2020 at 02:47:18PM +, Kalesh Singh wrote:
> Provides a per process hook for the acquisition of file descriptors,
> despite the method used to obtain the descriptor.

No, with the side of Fuck, No.

Driver has no possible reason to watch know the descriptors involved.
Moreover, it has no possible way to track that information _and_
no locking that could make that viable.

NAKed with extreme prejudice - never bring that idea back.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/2] dmabuf/tracing: Add dma-buf trace events

2020-08-04 Thread Al Viro
On Mon, Aug 03, 2020 at 11:28:31PM +0100, Al Viro wrote:

> IOW, what the hell is that horror for?  You do realize, for example, that 
> there's
> such thing as dup(), right?  And dup2() as well.  And while we are at it, how
> do you keep track of removals, considering the fact that you can stick a file
> reference into SCM_RIGHTS datagram sent to yourself, close descriptors and an 
> hour
> later pick that datagram, suddenly getting descriptor back?
> 
> Besides, "I have no descriptors left" != "I can't be currently sitting in the 
> middle
> of syscall on that sucker"; close() does *NOT* terminate ongoing operations.
> 
> You are looking at the drastically wrong abstraction level.  Please, describe 
> what
> it is that you are trying to achieve.

_IF_ it's "who keeps a particularly long-lived sucker pinned", I would suggest
fuser(1) run when you detect that kind of long-lived dmabuf.  With events 
generated
by their constructors and destructors, and detection of longevity done based on
that.

But that's only a semi-blind guess at the things you are trying to achieve; 
please,
describe what it really is.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH 2/2] dmabuf/tracing: Add dma-buf trace events

2020-08-04 Thread Al Viro
On Mon, Aug 03, 2020 at 09:22:53AM -0700, Suren Baghdasaryan wrote:
> On Mon, Aug 3, 2020 at 9:12 AM Matthew Wilcox  wrote:
> >
> > On Mon, Aug 03, 2020 at 09:00:00AM -0700, Suren Baghdasaryan wrote:
> > > On Mon, Aug 3, 2020 at 8:41 AM Matthew Wilcox  wrote:
> > > >
> > > > On Mon, Aug 03, 2020 at 02:47:19PM +, Kalesh Singh wrote:
> > > > > +static void dma_buf_fd_install(int fd, struct file *filp)
> > > > > +{
> > > > > + trace_dma_buf_fd_ref_inc(current, filp);
> > > > > +}
> > > >
> > > > You're adding a new file_operation in order to just add a new 
> > > > tracepoint?
> > > > NACK.
> > >
> > > Hi Matthew,
> > > The plan is to attach a BPF to this tracepoint in order to track
> > > dma-buf users. If you feel this is an overkill, what would you suggest
> > > as an alternative?
> >
> > I'm sure BPF can attach to fd_install and filter on file->f_ops belonging
> > to dma_buf, for example.
> 
> Sounds like a workable solution. Will explore that direction. Thanks Matthew!

No, it is not a solution at all.

What kind of locking would you use?  With _any_ of those approaches.

How would you use the information that is hopelessly out of 
date/incoherent/whatnot
at the very moment you obtain it?

IOW, what the hell is that horror for?  You do realize, for example, that 
there's
such thing as dup(), right?  And dup2() as well.  And while we are at it, how
do you keep track of removals, considering the fact that you can stick a file
reference into SCM_RIGHTS datagram sent to yourself, close descriptors and an 
hour
later pick that datagram, suddenly getting descriptor back?

Besides, "I have no descriptors left" != "I can't be currently sitting in the 
middle
of syscall on that sucker"; close() does *NOT* terminate ongoing operations.

You are looking at the drastically wrong abstraction level.  Please, describe 
what
it is that you are trying to achieve.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH v2 1/2] video: fbdev: amifb: add FIXME about dead APUS support

2020-06-03 Thread Al Viro
On Tue, Jun 02, 2020 at 02:03:12PM +0200, Geert Uytterhoeven wrote:
> On Tue, Jun 2, 2020 at 1:50 PM Bartlomiej Zolnierkiewicz
>  wrote:
> > On 5/14/20 10:21 PM, Geert Uytterhoeven wrote:
> > > These #ifdefs are relics from APUS (Amiga Power-Up System), which
> > > added a PPC board.  APUS support was killed off a long time ago,
> > > when arch/ppc/ was still king, but these #ifdefs were missed, because
> > > they didn't test for CONFIG_APUS.
> >
> > Add FIXME about using the C code variants (APUS ones) in the future.
> >
> > Reported-by: Al Viro 
> > Reported-by: Geert Uytterhoeven 
> > Signed-off-by: Bartlomiej Zolnierkiewicz 
> 
> Reviewed-by: Geert Uytterhoeven 

FWIW, has anyone managed to boot m68k linux kernel on e.g. FS-UAE?
I have done that on aranym (which is how I'd been doing all
testing for e.g. signal-related m68k patches) and I've seen
references to some out-of-tree qemu variant doing quadra, but
nothing for amiga emulators...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-22 Thread Al Viro
On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > From: Ira Weiny 
> > 
> > The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > enables when vaddr is not in the fixmap.
> > 
> > Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > Signed-off-by: Ira Weiny 
> 
> microblazeel works with this patch, as do the nosmp sparc32 boot tests,
> but sparc32 boot tests with SMP enabled still fail with lots of messages
> such as:

BTW, what's your setup for sparc32 boot tests?  IOW, how do you manage to
shrink the damn thing enough to have the loader cope with it?  I hadn't
been able to do that for the current mainline ;-/
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-22 Thread Al Viro
On Thu, May 21, 2020 at 11:46:12PM +0100, Al Viro wrote:
> On Thu, May 21, 2020 at 03:20:46PM -0700, Guenter Roeck wrote:
> > On 5/21/20 10:27 AM, Al Viro wrote:
> > > On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> > >> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> > >>> From: Ira Weiny 
> > >>>
> > >>> The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> > >>> enables when vaddr is not in the fixmap.
> > >>>
> > >>> Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> > >>> Signed-off-by: Ira Weiny 
> > >>
> > >> microblazeel works with this patch, as do the nosmp sparc32 boot tests,
> > >> but sparc32 boot tests with SMP enabled still fail with lots of messages
> > >> such as:
> > > 
> > > BTW, what's your setup for sparc32 boot tests?  IOW, how do you manage to
> > > shrink the damn thing enough to have the loader cope with it?  I hadn't
> > > been able to do that for the current mainline ;-/
> > > 
> > 
> > defconfig seems to work just fine, even after enabling various debug
> > and file system options.
> 
> The hell?  How do you manage to get the kernel in?  sparc32_defconfig
> ends up with 5316876 bytes unpacked...

Incidentally, trying to load it via -kernel/-initrd leads to
Configuration device id QEMU version 1 machine id 64
Probing SBus slot 0 offset 0
Probing SBus slot 1 offset 0
Probing SBus slot 2 offset 0
Probing SBus slot 3 offset 0
Probing SBus slot 15 offset 0
Invalid FCode start byte
CPUs: 1 x TI,TMS390Z55
UUID: ----
Welcome to OpenBIOS v1.1 built on Dec 27 2018 19:17
  Type 'help' for detailed information
[sparc] Kernel already loaded
switching to new context:
PROMLIB: obio_ranges 1
PROMLIB: Sun Boot Prom Version 3 Revision 2
Linux version 5.7.0-rc1-2-gcf51e129b968 (al@duke) (gcc version 6.3.0 
20170516 (Debian 6.3.0-18), GNU ld (GNU Binutils for Debian) 2.28) #32 Thu May 
21 18:36:07 EDT 2020
printk: bootconsole [earlyprom0] enabled
ARCH: SUN4M
TYPE: Sun4m SparcStation10/20
Ethernet address: 52:54:00:12:34:56
303MB HIGHMEM available.
OF stdout device is: /obio/zs@0,10:a
PROM: Built device tree with 30051 bytes of memory.
Booting Linux...
Power off control detected.
Kernel panic - not syncing: Failed to allocate memory for percpu areas.
CPU: 0 PID: 0 Comm: swapper Not tainted 5.7.0-rc1-2-gcf51e129b968 #32
[f04f92a8 : 
setup_per_cpu_areas+0x58/0x90 ] 
[f04edbf4 : 
start_kernel+0xc0/0x4a0 ] 
[f04ed43c : 
continue_boot+0x324/0x334 ] 
[ : 
0x0 ] 

Press Stop-A (L1-A) from sun keyboard or send break
twice on console to return to the boot prom
---[ end Kernel panic - not syncing: Failed to allocate memory for percpu 
areas. ]---

Giving guest more RAM doesn't change the outcome (well, the number HIGHMEM line 
is
obviously higher, but that's it).

So which sparc32 kernel have you booted with defconfig and how have you done
that?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-22 Thread Al Viro
On Fri, May 22, 2020 at 02:29:50AM +0100, Al Viro wrote:
> On Thu, May 21, 2020 at 06:11:08PM -0700, Guenter Roeck wrote:
> 
> > Mainline, with:
> > 
> > qemu-system-sparc -M SS-4 -kernel arch/sparc/boot/zImage -no-reboot \
> > -snapshot -drive file=rootfs.ext2,format=raw,if=scsi \
> > -append "panic=-1 slub_debug=FZPUA root=/dev/sda console=ttyS0"
> > -nographic -monitor none
> > 
> > The machine doesn't really matter, though.
> 
> It does, unfortunately - try that with SS-10 and watch what happens ;-/

Ugh...  It's actually something in -m handling: -m 256 passes, -m 512
leads to that panic.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-22 Thread Al Viro
On Thu, May 21, 2020 at 06:11:08PM -0700, Guenter Roeck wrote:

> Mainline, with:
> 
> qemu-system-sparc -M SS-4 -kernel arch/sparc/boot/zImage -no-reboot \
>   -snapshot -drive file=rootfs.ext2,format=raw,if=scsi \
>   -append "panic=-1 slub_debug=FZPUA root=/dev/sda console=ttyS0"
>   -nographic -monitor none
> 
> The machine doesn't really matter, though.

It does, unfortunately - try that with SS-10 and watch what happens ;-/
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] arch/{mips,sparc,microblaze,powerpc}: Don't enable pagefault/preempt twice

2020-05-22 Thread Al Viro
On Thu, May 21, 2020 at 03:20:46PM -0700, Guenter Roeck wrote:
> On 5/21/20 10:27 AM, Al Viro wrote:
> > On Tue, May 19, 2020 at 09:54:22AM -0700, Guenter Roeck wrote:
> >> On Mon, May 18, 2020 at 11:48:43AM -0700, ira.we...@intel.com wrote:
> >>> From: Ira Weiny 
> >>>
> >>> The kunmap_atomic clean up failed to remove one set of pagefault/preempt
> >>> enables when vaddr is not in the fixmap.
> >>>
> >>> Fixes: bee2128a09e6 ("arch/kunmap_atomic: consolidate duplicate code")
> >>> Signed-off-by: Ira Weiny 
> >>
> >> microblazeel works with this patch, as do the nosmp sparc32 boot tests,
> >> but sparc32 boot tests with SMP enabled still fail with lots of messages
> >> such as:
> > 
> > BTW, what's your setup for sparc32 boot tests?  IOW, how do you manage to
> > shrink the damn thing enough to have the loader cope with it?  I hadn't
> > been able to do that for the current mainline ;-/
> > 
> 
> defconfig seems to work just fine, even after enabling various debug
> and file system options.

The hell?  How do you manage to get the kernel in?  sparc32_defconfig
ends up with 5316876 bytes unpacked...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH V2 00/11] Subject: Remove duplicated kmap code

2020-05-05 Thread Al Viro
On Mon, May 04, 2020 at 01:17:41PM -0700, Ira Weiny wrote:

> > || * arm: much, much worse.  We have several files that pull 
> > linux/highmem.h:
> > || arch/arm/mm/cache-feroceon-l2.c, arch/arm/mm/cache-xsc3l2.c,
> > || arch/arm/mm/copypage-*.c, arch/arm/mm/dma-mapping.c, arch/arm/mm/flush.c,
> > || arch/arm/mm/highmem.c, arch/arm/probes/uprobes/core.c,
> > || arch/arm/include/asm/kvm_mmu.h (kmap_atomic_pfn()).
> > || Those are fine, but we also have this:
> > || arch/arm/include/asm/pgtable.h:200:#define __pte_map(pmd)   
> > (pte_t *)kmap_atomic(pmd_page(*(pmd)))
> > || arch/arm/include/asm/pgtable.h:208:#define pte_offset_map(pmd,addr) 
> > (__pte_map(pmd) + pte_index(addr))
> > || and sure as hell, asm/pgtable.h does *NOT* pull linux/highmem.h.
> 
> It does not pull asm/highmem.h either...

No, but the users of those macros need to be considered.

> > || #define pte_offset_map(dir, addr)   \
> > || ((pte_t *) kmap_atomic(pmd_page(*(dir))) + pte_index(addr))
> > || One pte_offset_map user in arch/microblaze:
> > || arch/microblaze/kernel/signal.c:207:ptep = pte_offset_map(pmdp, 
> > address);
> > || Messy, but doesn't require any changes (we have asm/pgalloc.h included
> > || there, and that pull linux/highmem.h).
> 
> AFAICS asm/pgtable.h does not include asm/highmem.h here...
> 
> So looks like arch/microblaze/kernel/signal.c will need linux/highmem.h

See above - line 39 in there is
#include 
and line 14 in arch/microblaze/include/asm/pgalloc.h is
#include 
It's conditional upon CONFIG_MMU in there, but so's the use of
pte_offset_map() in arch/microblaze/kernel/signal.c 

So it shouldn't be a problem.

> > || * xtensa: users in arch/xtensa/kernel/pci-dma.c, 
> > arch/xtensa/mm/highmem.c,
> > || arch/xtensa/mm/cache.c and arch/xtensa/platforms/iss/simdisk.c (all pull
> > || linux/highmem.h).
> 
> Actually
> 
> arch/xtensa/mm/cache.c gets linux/highmem.h from linux/pagemap.h
> 
> arch/xtensa/platforms/iss/simdisk.c may have an issue?
>   linux/blkdev.h -> CONFIG_BLOCK -> linux/pagemap.h -> linux/highmem.h
>   But simdisk.c requires BLK_DEV_SIMDISK -> CONFIG_BLOCK...
>   

Yep - see above re major chain of indirect includes conditional upon 
CONFIG_BLOCK
and its uses in places that only build with such configs.  There's a plenty of
similar considerations outside of arch/*, unfortunately...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH V2 00/11] Subject: Remove duplicated kmap code

2020-05-04 Thread Al Viro
On Sun, May 03, 2020 at 06:09:01PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> The kmap infrastructure has been copied almost verbatim to every architecture.
> This series consolidates obvious duplicated code by defining core functions
> which call into the architectures only when needed.
> 
> Some of the k[un]map_atomic() implementations have some similarities but the
> similarities were not sufficient to warrant further changes.
> 
> In addition we remove a duplicate implementation of kmap() in DRM.
> 
> Testing was done by 0day to cover all the architectures I can't readily
> build/test.

OK...  Looking through my old notes on kmap unification (this winter, never
went anywhere),

* arch/mips/mm/cache.c ought to use linux/highmem.h, not asm/highmem.h
I suspect that your series doesn't build on some configs there.  Hadn't
verified that, though.

* kmap_atomic_to_page() is dead, but not quite gone - csky and nds32 brought
the damn thing back (nds32 - only an extern).  It needs killin'...

* parisc is (arguably) abusing kunmap()/kunmap_atomic() for cache flushing.
Replace the bulk of its highmem.h with
#define ARCH_HAS_FLUSH_ON_KUNMAP
#define arch_before_kunmap flush_kernel_dcache_page_addr
and have default kunmap()/kunmap_atomic() do
#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
arch_before_kunmap(page_address(page));
#endif
and
#ifdef ARCH_HAS_FLUSH_ON_KUNMAP
arch_before_kunmap(addr);
#endif
resp.  Kills ARCH_HAS_KMAP along with ifdefs on it, makes parisc use somewhat
less hacky.

I'd suggest checking various configs on mips - that's likely to cause headache.
Said that, my analysis of include chains back then is pretty much worthless
by now - I really hate the amount of indirect include chains leading to that
sucker on some, but not all configs ;-/  IIRC, the proof that everything
using kmap*/kunmap* would pull linux/highmem.h regardless of config took several
hours of digging, ran for several pages and had been hopelessly brittle.
arch/mips/mm/cache.c was the only exception caught by it, but these days
there might be more.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH V2 00/11] Subject: Remove duplicated kmap code

2020-05-04 Thread Al Viro
On Sun, May 03, 2020 at 10:04:47PM -0700, Ira Weiny wrote:

> Grepping for 'asm/highmem.h' and investigations don't reveal any issues...  
> But
> you do have me worried.  That said 0-day has been crunching on multiple
> versions of this series without issues such as this (save the mips issue
> above).
> 
> I have to say it would be nice if the relation between linux/highmem.h and
> asm/highmem.h was more straightforward.

IIRC, the headache was due to asm/pgtable.h on several architectures and
asm/cacheflush.h on parisc.



|| IOW, there's one in linux/highmem.h (conditional on !CONFIG_HIGHMEM,
|| !ARCH_HAS_KMAP) and several per-architecture variants, usually declared in
|| their asm/highmem.h.  In three of those (microblaze, parisc and powerpc) 
these
|| are inlines (parisc one identical to linux/highmem.h, lives in 
asm/cacheflush.h,
|| powerpc and microblaze ones calling kmap_atomic_prot() which is defined in
|| arch/$ARCH/mm/highmem.c).
|| 
|| parisc case is weird - essentially, they want to call 
|| flush_kernel_dcache_page_addr() at the places where kunmap/kunmap_atomic
|| is done.  And they do so despite not selecting HIGHMEM, with definitions
|| in usual place.  They do have ARCH_HAS_KMAP defined, which prevents
|| getting buggered in linux/highmem.h.  ARCH_HAS_KMAP is parisc-unique,
|| BTW, and checked only in linux/highmem.h.
|| 
|| All genuine arch-specific variants are defined in (or call functions
|| defined in) translation units that are only included CONFIG_HIGHMEM builds.
|| 
|| It would be tempting to consolidate those, e.g. by adding 
__kmap_atomic()
|| and __kmap_atomic_prot() without that boilerplate, with universal 
kmap_atomic()
|| and kmap_atomic_prot() in linux/highmem.h.  Potential problem with that would
|| be something that pulls ash/highmem.h (or asm/cacheflush.h in case of parisc)
|| directly and uses kmap_atomic/kmap_atomic_prot.  There's not a lot places
|| pulling asm/highmem.h, and many of them are not even in includes:
|| 
|| arch/arm/include/asm/efi.h:13:#include 
|| arch/arm/mm/dma-mapping.c:31:#include 
|| arch/arm/mm/flush.c:14:#include 
|| arch/arm/mm/mmu.c:27:#include 
|| arch/mips/include/asm/pgtable-32.h:22:#include 
|| arch/mips/mm/cache.c:19:#include 
|| arch/mips/mm/fault.c:28:#include /* For 
VMALLOC_END */
|| arch/nds32/include/asm/pgtable.h:60:#include 
|| arch/x86/kernel/setup_percpu.c:20:#include 
|| include/linux/highmem.h:35:#include 
|| 
|| Users of asm/cacheflush.h are rather more numerous; however, anything
|| outside of parisc-specific code has to pull linux/highmem.h, or it won't see
|| the definitions of kmap_atomic/kmap_atomic_prot at all.  arch/parisc itself
|| has no callers of those.
|| 
|| Outside of arch/* there is a plenty of callers.  However, the following is
|| true: all instances of kmap_atomic or kmap_atomic_prot outside of arch/*
|| are either inside the linux/highmem.h or are preceded by include of
|| linux/highmem.h on any build that sees them (there is a common include
|| chain that is conditional upon CONFIG_BLOCK, but it's only needed in
|| drivers that are BLOCK-dependent).  It was not fun to verify, to put
|| it mildly...
|| 
|| So for parisc we have no problem - leaving 
__kmap_atomic()/__kmap_atomic_prot()
|| in asm/cachefile.h and adding universal wrappers in linux/highmem.h will be
|| fine.  For other architectures the things might be trickier.
|| 
|| * arc: all users in arch/arc/ are within arch/arc/mm/{cache,highmem}.c;
|| both pull linux/highmem.h.  We are fine.
|| 
|| * arm: much, much worse.  We have several files that pull linux/highmem.h:
|| arch/arm/mm/cache-feroceon-l2.c, arch/arm/mm/cache-xsc3l2.c,
|| arch/arm/mm/copypage-*.c, arch/arm/mm/dma-mapping.c, arch/arm/mm/flush.c,
|| arch/arm/mm/highmem.c, arch/arm/probes/uprobes/core.c,
|| arch/arm/include/asm/kvm_mmu.h (kmap_atomic_pfn()).
|| Those are fine, but we also have this:
|| arch/arm/include/asm/pgtable.h:200:#define __pte_map(pmd)   
(pte_t *)kmap_atomic(pmd_page(*(pmd)))
|| arch/arm/include/asm/pgtable.h:208:#define pte_offset_map(pmd,addr) 
(__pte_map(pmd) + pte_index(addr))
|| and sure as hell, asm/pgtable.h does *NOT* pull linux/highmem.h.
|| 
|| Fortunately, the users of pte_offset_map() (__pte_map() has no other users)
|| are few, both in arch/arm and outside of arch.  All arm ones are pulling
|| linux/highmem (arch/arm/mm/{pgd,fault*}.c).  Outside of arch we have several
|| that pull highmem.h (by way of rmap.h or pagemap.h, usually):
|| fs/userfaultfd.c, mm/gup.c, mm/hmm.c, mm/huge_memory.c,
|| mm/khugepaged.c, mm/memory-failure.c, mm/memory.c, mm/migrate.c,
|| mm/mremap.c, mm/page_vma_mapped.c, mm/swap_state.c, mm/swapfile.c,
|| mm/vmalloc.c
|| and then there are these in linux/mm.h:
|| 
|| #define pte_offset_map_lock(mm, pmd, address, ptlp) \
|| ({  \
|| spinlock_t *__ptl = pte_lockptr(mm, 

Re: [PATCH V1 08/10] arch/kmap: Don't hard code kmap_prot values

2020-05-01 Thread Al Viro
On Thu, Apr 30, 2020 at 01:38:43PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> To support kmap_atomic_prot() on all architectures each arch must
> support protections passed in to them.
> 
> Change csky, mips, nds32 and xtensa to use their global kmap_prot value
> rather than a hard coded value which was equal.

Minor nitpick: it's probably worth pointing out that kmap_prot on those
is a constant...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH V1 09/10] arch/kmap: Define kmap_atomic_prot() for all arch's

2020-05-01 Thread Al Viro
On Thu, Apr 30, 2020 at 01:38:44PM -0700, ira.we...@intel.com wrote:

> -static inline void *kmap_atomic(struct page *page)
> +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
>  {
>   preempt_disable();
>   pagefault_disable();
>   if (!PageHighMem(page))
>   return page_address(page);
> - return kmap_atomic_high(page);
> + return kmap_atomic_high_prot(page, prot);
>  }
> +#define kmap_atomic(page)kmap_atomic_prot(page, kmap_prot)

OK, so it *was* just a bisect hazard - you return to original semantics
wrt preempt_disable()...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH V1 05/10] arch/kmap_atomic: Consolidate duplicate code

2020-05-01 Thread Al Viro
On Thu, Apr 30, 2020 at 01:38:40PM -0700, ira.we...@intel.com wrote:
> From: Ira Weiny 
> 
> Every arch has the same code to ensure atomic operations and a check for
> !HIGHMEM page.
> 
> Remove the duplicate code by defining a core kmap_atomic() which only
> calls the arch specific kmap_atomic_high() when the page is high memory.

Err  AFAICS, you've just silently changed the semantics for
kmap_atomic_prot() here.  And while most of the callers are converted,
drivers/gpu/drm/ttm/ttm_bo_util.c one is not, so at the very least it's
a bisect hazard...

And I would argue that having kmap_atomic() differ from kmap_atomic_prot()
wrt disabling preempt is asking for trouble.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH V1 09/10] arch/kmap: Define kmap_atomic_prot() for all arch's

2020-05-01 Thread Al Viro
On Fri, May 01, 2020 at 03:37:34AM +0100, Al Viro wrote:
> On Thu, Apr 30, 2020 at 01:38:44PM -0700, ira.we...@intel.com wrote:
> 
> > -static inline void *kmap_atomic(struct page *page)
> > +static inline void *kmap_atomic_prot(struct page *page, pgprot_t prot)
> >  {
> > preempt_disable();
> > pagefault_disable();
> > if (!PageHighMem(page))
> > return page_address(page);
> > -   return kmap_atomic_high(page);
> > +   return kmap_atomic_high_prot(page, prot);
> >  }
> > +#define kmap_atomic(page)  kmap_atomic_prot(page, kmap_prot)
> 
> OK, so it *was* just a bisect hazard - you return to original semantics
> wrt preempt_disable()...

FWIW, how about doing the following: just before #5/10 have a patch
that would touch only microblaze, ppc and x86 splitting their
kmap_atomic_prot() into an inline helper + kmap_atomic_high_prot().
Then your #5 would leave their kmap_atomic_prot() as-is (it would
use kmap_atomic_prot_high() instead).  The rest of the series plays
out pretty much the same way it does now, and wrappers on those
3 architectures would go away when an identical generic one is
introduced in this commit (#9/10).

AFAICS, that would avoid the bisect hazard and might even end
up with less noise in the patches...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-07 Thread Al Viro
On Wed, Aug 07, 2019 at 08:30:02AM +0200, Christoph Hellwig wrote:
> On Tue, Aug 06, 2019 at 12:50:10AM -0700, Hugh Dickins wrote:
> > Though personally I'm averse to managing "f"objects through
> > "m"interfaces, which can get ridiculous (notably, MADV_HUGEPAGE works
> > on the virtual address of a mapping, but the huge-or-not alignment of
> > that mapping must have been decided previously).  In Google we do use
> > fcntls F_HUGEPAGE and F_NOHUGEPAGE to override on a per-file basis -
> > one day I'll get to upstreaming those.
> 
> Such an interface seems very useful, although the two fcntls seem a bit
> odd.
> 
> But I think the point here is that the i915 has its own somewhat odd
> instance of tmpfs.  If we could pass the equivalent of the huge=*
> options to shmem_file_setup all that garbage (including the
> shmem_file_setup_with_mnt function) could go away.

... or follow shmem_file_super() with whatever that fcntl maps to
internally.  I would really love to get rid of that i915 kludge.


Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-07 Thread Al Viro
On Tue, Aug 06, 2019 at 12:50:10AM -0700, Hugh Dickins wrote:

> that mapping must have been decided previously).  In Google we do use
> fcntls F_HUGEPAGE and F_NOHUGEPAGE to override on a per-file basis -
> one day I'll get to upstreaming those.

That'd be nice - we could kill the i915 wierd private shmem instance,
along with some kludges in mm/shmem.c.


Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-05 Thread Al Viro
On Mon, Aug 05, 2019 at 07:12:55PM +0100, Al Viro wrote:
> On Tue, Aug 06, 2019 at 01:03:06AM +0900, Sergey Senozhatsky wrote:
> > tmpfs does not set ->remount_fs() anymore and its users need
> > to be converted to new mount API.
> 
> Could you explain why the devil do you bother with remount at all?
> Why not pass the right options when mounting the damn thing?

Incidentally, the only remaining modular user of get_fs_type() is the
same i915_gemfs.c.  And I wonder if we should aim for unexporting
the damn thing instead of exporting put_filesystem()...

Note that users in tomoyo and apparmor are bogus - they are in the
instances of ill-defined method that needs to be split and moved,
with the lookups (fs type included) replaced with callers passing
the values they look up and will end up using.

IOW, outside of core VFS we have very few legitimate users, and the
one in kernel/trace might be better off as vfs_submount_by_name().


Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-05 Thread Al Viro
On Mon, Aug 05, 2019 at 07:12:55PM +0100, Al Viro wrote:
> On Tue, Aug 06, 2019 at 01:03:06AM +0900, Sergey Senozhatsky wrote:
> > tmpfs does not set ->remount_fs() anymore and its users need
> > to be converted to new mount API.
> 
> Could you explain why the devil do you bother with remount at all?
> Why not pass the right options when mounting the damn thing?

... and while we are at it, I really wonder what's going on with
that gemfs thing - among the other things, this is the only
user of shmem_file_setup_with_mnt().  Sure, you want your own
options, but that brings another question - is there any reason
for having the huge=... per-superblock rather than per-file?

After all, the readers of ->huge in mm/shmem.c are
mm/shmem.c:582: (shmem_huge == SHMEM_HUGE_FORCE || sbinfo->huge) &&
is_huge_enabled(), sbinfo is an explicit argument

mm/shmem.c:1799:switch (sbinfo->huge) {
shmem_getpage_gfp(), sbinfo comes from inode

mm/shmem.c:2113:if (SHMEM_SB(sb)->huge == SHMEM_HUGE_NEVER)
shmem_get_unmapped_area(), sb comes from file

mm/shmem.c:3531:if (sbinfo->huge)
mm/shmem.c:3532:seq_printf(seq, ",huge=%s", 
shmem_format_huge(sbinfo->huge));
->show_options()
mm/shmem.c:3880:switch (sbinfo->huge) {
shmem_huge_enabled(), sbinfo comes from an inode

And the only caller of is_huge_enabled() is shmem_getattr(), with sbinfo
picked from inode.

So is there any reason why the hugepage policy can't be per-file, with
the current being overridable default?


Re: [PATCHv2 2/3] i915: convert to new mount API

2019-08-05 Thread Al Viro
On Tue, Aug 06, 2019 at 01:03:06AM +0900, Sergey Senozhatsky wrote:
> tmpfs does not set ->remount_fs() anymore and its users need
> to be converted to new mount API.

Could you explain why the devil do you bother with remount at all?
Why not pass the right options when mounting the damn thing?


Re: [PATCH] drm: return -EFAULT if copy_one_buf() fails

2019-06-26 Thread Al Viro
On Tue, Jun 18, 2019 at 03:56:23PM +0300, Dan Carpenter wrote:
> The copy_to_user() function returns the number of bytes remaining to be
> copied, but we want to return -EFAULT.  This function is called from
> __drm_legacy_infobufs() which expects negative error codes.
> 
> Fixes: 5c7640ab6258 ("switch compat_drm_infobufs() to drm_ioctl_kernel()")
> Signed-off-by: Dan Carpenter 
> ---
> This goes through Al's tree and not through drm.  Presumably this patch
> will just get folded into the original.

Wha..?  The original has been in mainline since v4.13, so it's a bit too
late to fold anything into it...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2] drm: return -EFAULT if copy_to_user() fails

2019-06-25 Thread Al Viro
On Tue, Jun 18, 2019 at 01:16:29PM -0400, Sean Paul wrote:
> On Tue, Jun 18, 2019 at 04:18:43PM +0300, Dan Carpenter wrote:
> > The copy_from_user() function returns the number of bytes remaining
> > to be copied but we want to return a negative error code.  Otherwise
> > the callers treat it as a successful copy.
> > 
> > Signed-off-by: Dan Carpenter 
> 
> Thanks Dan, I've applied this to drm-misc-fixes.

FWIW, Acked-by: Al Viro 


Re: [PATCH v3 12/26] compat_ioctl: move more drivers to compat_ptr_ioctl

2019-04-26 Thread Al Viro
On Thu, Apr 25, 2019 at 12:21:53PM -0300, Mauro Carvalho Chehab wrote:

> If I understand your patch description well, using compat_ptr_ioctl
> only works if the driver is not for s390, right?

No; s390 is where "oh, just set ->compat_ioctl same as ->unlocked_ioctl
and be done with that; compat_ptr() is a no-op anyway" breaks.  IOW,
s390 is the reason for having compat_ptr_ioctl() in the first place;
that thing works on all biarch architectures, as long as all stuff
handled by ->ioctl() takes pointer to arch-independent object as
argument.  IOW,
argument ignored => OK
any arithmetical type => no go, compat_ptr() would bugger it
pointer to int => OK
pointer to string => OK
pointer to u64 => OK
pointer to struct {u64 addr; char s[11];} => OK
pointer to long => needs explicit handler
pointer to struct {void *addr; char s[11];} => needs explicit handler
pointer to struct {int x; u64 y;} => needs explicit handler on amd64
For "just use ->unlocked_ioctl for ->ioctl" we have
argument ignored => OK
any arithmetical type => OK
any pointer => instant breakage on s390, in addtion to cases that break
with compat_ptr_ioctl().

Probably some form of that ought to go into commit message for 
compat_ptr_ioctl()
introduction...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 12/26] compat_ioctl: move more drivers to compat_ptr_ioctl

2019-04-26 Thread Al Viro
On Thu, Apr 25, 2019 at 05:55:23PM +0200, Arnd Bergmann wrote:
> On Thu, Apr 25, 2019 at 5:35 PM Al Viro  wrote:
> >
> > On Thu, Apr 25, 2019 at 12:21:53PM -0300, Mauro Carvalho Chehab wrote:
> >
> > > If I understand your patch description well, using compat_ptr_ioctl
> > > only works if the driver is not for s390, right?
> >
> > No; s390 is where "oh, just set ->compat_ioctl same as ->unlocked_ioctl
> > and be done with that; compat_ptr() is a no-op anyway" breaks.  IOW,
> > s390 is the reason for having compat_ptr_ioctl() in the first place;
> > that thing works on all biarch architectures, as long as all stuff
> > handled by ->ioctl() takes pointer to arch-independent object as
> > argument.  IOW,
> > argument ignored => OK
> > any arithmetical type => no go, compat_ptr() would bugger it
> > pointer to int => OK
> > pointer to string => OK
> > pointer to u64 => OK
> > pointer to struct {u64 addr; char s[11];} => OK
> 
> To be extra pedantic, the 'struct {u64 addr; char s[11];} '
> case is also broken on x86, because sizeof (obj) is smaller
> on i386, even though the location of the members are
> the same. i.e. you can copy_from_user() this, but not
> copy_to_user(), which overwrites 4 bytes after the end of
> the 20-byte user structure.

D'oh!  FWIW, it might be worth putting into Documentation/ somewhere;
basically, what is and what isn't biarch-neutral.

Or arch-neutral, for that matter - it's very close.  The only real
exception, IIRC, is an extra twist on m68k, where int behaves
like x86 long long - its alignment is only half its size, so
sizeof(struct {char c; int x;}) is 6, not 8 as everywhere
else.  Irrelevant for biarch, thankfully (until somebody gets insane
enough to implement 64bit coldfire, kernel port for it *and* biarch
support for m68k binaries on that thing, that is)...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 14/26] compat_ioctl: move ATYFB_CLK handling to atyfb driver

2019-04-18 Thread Al Viro
On Tue, Apr 16, 2019 at 10:25:35PM +0200, Arnd Bergmann wrote:
> +static int atyfb_compat_ioctl(struct fb_info *info, u_int cmd, u_long arg)
> +{
> + return atyfb_ioctl(info, cmd, (u_long)compat_ptr(arg));
> +}
> +#endif

Huh?  Why isn't that using compat_ioctl_ptr()?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v3 14/26] compat_ioctl: move ATYFB_CLK handling to atyfb driver

2019-04-18 Thread Al Viro
On Wed, Apr 17, 2019 at 10:27:00PM +0100, Al Viro wrote:
> On Tue, Apr 16, 2019 at 10:25:35PM +0200, Arnd Bergmann wrote:
> > +static int atyfb_compat_ioctl(struct fb_info *info, u_int cmd, u_long arg)
> > +{
> > +   return atyfb_ioctl(info, cmd, (u_long)compat_ptr(arg));
> > +}
> > +#endif
> 
> Huh?  Why isn't that using compat_ioctl_ptr()?

Oh, I see...  Nevermind, then.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: INFO: rcu detected stall in sys_sendfile64 (2)

2019-03-12 Thread Al Viro
On Mon, Mar 11, 2019 at 08:59:00PM -0700, syzbot wrote:
> syzbot has bisected this bug to:
> 
> commit 34e07e42c55aeaa78e93b057a6664e2ecde3fadb
> Author: Chris Wilson 
> Date:   Thu Feb 8 10:54:48 2018 +
> 
> drm/i915: Add missing kerneldoc for 'ent' in i915_driver_init_early
> 
> bisection log:  https://syzkaller.appspot.com/x/bisect.txt?x=1322028320
> start commit:   34e07e42 drm/i915: Add missing kerneldoc for 'ent' in i915..
> git tree:   upstream
> final crash:https://syzkaller.appspot.com/x/report.txt?x=10a2028320
> console output: https://syzkaller.appspot.com/x/log.txt?x=1722028320
> kernel config:  https://syzkaller.appspot.com/x/.config?x=abc3dc9b7a900258
> dashboard link: https://syzkaller.appspot.com/bug?extid=1505c80c74256c6118a5
> userspace arch: amd64
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=12c4dc28c0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=15df4108c0
> 
> Reported-by: syzbot+1505c80c74256c611...@syzkaller.appspotmail.com
> Fixes: 34e07e42 ("drm/i915: Add missing kerneldoc for 'ent' in
> i915_driver_init_early")

Umm...  Might be a good idea to add some plausibility filters - it is,
in theory, possible that adding a line in a comment changes behaviour
(without compiler bugs, even - playing with __LINE__ is all it would
take), but the odds that it's _not_ a false positive are very low.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel

Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-17 Thread Al Viro
On Fri, Sep 14, 2018 at 01:35:06PM -0700, Darren Hart wrote:
 
> Acked-by: Darren Hart (VMware) 
> 
> As for a longer term solution, would it be possible to init fops in such
> a way that the compat_ioctl call defaults to generic_compat_ioctl_ptrarg
> so we don't have to duplicate this boilerplate for every ioctl fops
> structure?

Bad idea, that...  Because several years down the road somebody will add
an ioctl that takes an unsigned int for argument.  Without so much as looking
at your magical mystery macro being used to initialize file_operations.

FWIW, I would name that helper in more blunt way - something like
compat_ioctl_only_compat_pointer_ioctls_here()...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH] drm: Add missing field copy in compat_drm_version

2017-07-14 Thread Al Viro
On Thu, Jul 13, 2017 at 02:36:55PM +0200, Daniel Vetter wrote:
> On Wed, Jul 12, 2017 at 02:18:32PM +0800, Jeffy Chen wrote:
> > DRM_IOCTL_VERSION is supposed to update the name_len/date_len/desc_len
> > fields to user.
> > 
> > Fixes: 012c6741c6aa("switch compat_drm_version() to drm_ioctl_kernel()")
> > Signed-off-by: Jeffy Chen <jeffy.c...@rock-chips.com>
> 
> Reviewed-by: Daniel Vetter <daniel.vet...@ffwll.ch>
> 
> Linus, since Dave is a bit out with flu and this bug only exists in your
> tree (not yet in drm-next), can you pls apply this directly? It's a fumble
> in Al's rework. Direct mbox link from patchwork, in case you don't have
> that in your archives anywhere:

(Belated) ACKed-by: Al Viro <v...@zeniv.linux.org.uk>
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH][drm-next] drm compat: ensure mode in drm_agp_info is being copied

2017-07-04 Thread Al Viro
On Tue, Jul 04, 2017 at 05:48:21PM +0100, Colin King wrote:
> From: Colin Ian King 
> 
> A recent compat change removed the copying of i32.mode from info.mode.
> Add it back in to fix this removal as we currently are leaking information
> from the stack.
> 
> Detected by CoverityScan, CID#1449374 ("Unitialized scalar variable")

Folded and pushed out.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: __user with scalar data types

2017-06-19 Thread Al Viro
On Mon, Jun 19, 2017 at 10:32:18PM +0200, Luc Van Oostenryck wrote:
> On Mon, Jun 19, 2017 at 10:15:09AM -0600, Jordan Crouse wrote:
> > struct uapistruct {
> > ...
> > __u64 __user myptr;
> > ---
> > };
> > 
> > And then converting it for use in the kernel as such:
> > 
> > {
> > void __user *userptr = (void __user *)(uintptr_t)args->myptr;
> > 
> > copy_from_user(local, userptr, size);
> > ...
> > }
> > 
> > The problem is that sparse doesn't like the momentary switch to
> > uintptr_t:
> > 
> > warning: dereference of noderef expression
> 
> This warning doesn't come from the cast to uintptr_t but
> simply from dereferencing the field which can't be dereferenced
> since it's marked as '__user'. In other words, doing
> 'args->myptr' rightfully trigger the warning and no cast
> will or should stop that.
> 
> Also, you can't expect the '__user' to be transmitted from
> 'myptr' to the pointer (without taking the address of 'myptr').
> It's exactly like 'const int' vs. 'const int *': the '__user' or
> the 'const' is not at the same level in the type hierarchy
> ('const object' vs. 'non-const pointer to const object').

Besides, suppose you add a special type for that.  How would it
have to behave, really?  AFAICS, you want something similar to
__bitwise, except that (assuming this type is T)
T + integer => T
T - integer => T
T & integer => integer
T | integer => T
T - T => integer (quietly decay to underlying type for both
arguments, then treat as normal -)
T & T => T (probably, but might be worth a warning)
T | T => T (ditto)
comparison - same as for __bitwise
constant conversion: 0 should convert clean, anything else - a warning
cast to pointer => warn unless the target type is __user?  But that's
not going to help with cast through uintptr_t...
?: as usual
any other arithmetics => warn and decay to underlying integer type

It might be not impossible to implement, but it sure as hell won't be __user
and it'll need careful thinking about the semantics of those annotations.
The outline above is just that - figuring out if there are any nasty corner
cases will take some work.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: __user with scalar data types

2017-06-19 Thread Al Viro
On Mon, Jun 19, 2017 at 10:15:09AM -0600, Jordan Crouse wrote:

> Which raised a bikeshed debate over whether it is appropriate to mark a scalar
> type as __user.  My opinion is that it is appropriate because __user should 
> mark
> user memory regardless of the container.

What the hell?  __user is a qualifier like const, volatile, etc.  It's a 
property
of *pointer* *type*.  Not some nebulous "marks userland memory" thing.

> I'm looking for opinions or semi-authoritative edicts to determine if we 
> should
> either start changing our uapi headers or go off and try to figure out how to
> make sparse understand this particular usage.

Stop cargo-culting, please.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [PATCH RESEND] drm/via: use get_user_pages_unlocked()

2017-02-28 Thread Al Viro
On Tue, Feb 28, 2017 at 10:01:10AM +0100, Daniel Vetter wrote:

> > +   ret = get_user_pages_unlocked((unsigned long)xfer->mem_addr,
> > +   vsg->num_pages, vsg->pages,
> > +   (vsg->direction == DMA_FROM_DEVICE) ? FOLL_WRITE : 0);

Umm...  Why not
ret = get_user_pages_fast((unsigned long)xfer->mem_addr,
vsg->num_pages,
vsg->direction == DMA_FROM_DEVICE,
vsg->pages);

IOW, do you really need a warranty that ->mmap_sem will be grabbed and
released?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC 0/6] Module for tracking/accounting shared memory buffers

2016-10-12 Thread Al Viro
On Tue, Oct 11, 2016 at 04:50:04PM -0700, Ruchi Kandoi wrote:

> memtrack maintains a per-process list of shared buffer references, which is
> exported to userspace as /proc/[pid]/memtrack.  Buffers can be optionally
> "tagged" with a short string: for example, Android userspace would use this
> tag to identify whether buffers were allocated on behalf of the camera stack,
> GL, etc.  memtrack also exports the VMAs associated with these buffers so
> that pages already included in the process's mm counters aren't 
> double-counted.
> 
> Shared-buffer allocators can hook into memtrack by embedding
> struct memtrack_buffer in their buffer metadata, calling
> memtrack_buffer_{init,remove} at buffer allocation and free time, and
> memtrack_buffer_{install,uninstall} when a userspace process takes or
> drops a reference to the buffer.  For fd-backed buffers like dma-bufs, hooks 
> in
> fdtable.c and fork.c automatically notify memtrack when references are added 
> or
> removed from a process's fd table.
> 
> This patchstack adds memtrack hooks into dma-buf and ion.  If there's upstream
> interest in memtrack, it can be extended to other memory allocators as well,
> such as GEM implementations.

No, with a side of Hell, No.  Not to mention anything else,
* descriptor tables do not belong to any specific task_struct and
actions done by one show up in all who share that thing.
* shared descriptor table does not imply belonging to the same
group.
* shared descriptor table can become unshared at any point, invisibly
for that Fine Piece Of Software.
* while we are at it, blocking allocation under several spinlocks
(and with interrupts disabled, for good measure) is generally considered
a bloody bad idea.

That - just from the quick look through that patchset.  Bringing task_struct
into the API is already sufficient for a NAK.


Use of copy_from_user in msm_gem_submit.c while holding a spin_lock

2016-08-17 Thread Al Viro
On Wed, Aug 17, 2016 at 03:24:38PM -0400, Rob Clark wrote:

> hmm, looks like, at least on arm (not sure about arm64),
> 
> #define __copy_from_user_inatomic __copy_from_user
> 
> ie. copy_from_user() minus the access_ok() and memset in the
> !access_ok() path.. but maybe what I want is just the
> pagefault_disable() if that disables copy_from_user() being able to
> block..

On a bunch of platforms copy_from_user() starts with might_sleep(); again,
that'll spread to all of the pretty soon.

Right now those primitives are very badly out of sync; this will change,
but let's not add more PITA sources.


Use of copy_from_user in msm_gem_submit.c while holding a spin_lock

2016-08-17 Thread Al Viro
On Wed, Aug 17, 2016 at 02:49:32PM -0400, Rob Clark wrote:

> I'm not saying that I shouldn't fix it (although not quite sure how
> yet.. taking/dropping the spinlock inside the loop is not a good
> option from a performance standpoint).  What I am saying is that this
> is not something that can happen accidentally (as it could in the case
> of swap).  But I agree that I should fix it somehow to avoid issues
> with an intentionally evil userspace.

I wouldn't count on that not happening by accident.  With zero changes
in mesa itself - it can be as simple as change of allocator in the
bowels of libc or throwing libdmalloc into the link flags, etc.  And most
of the time it would've worked just fine, but the same call in a situation
when most of the memory is occupied by dirty pagecache pages can end up
having to wait for writeback.

> If there is a copy_from_user() variant that will return an error
> instead of blocking, I think that is really what I want so I can
> implement a slow-path that drops the spin-lock temporarily.

*shrug*

pagefault_disable()/pagefault_enable() are there for purpose, so's
__copy_from_user_inatomic()...  Just remember that __copy_from_user_inatomic()
does not check if the addresses are userland ones (i.e. the caller needs
to check access_ok() itself) and it is *NOT* guaranteed to zero what it
hadn't copied over.  Currently it does zero tail on some, but not all
architectures; come next cycle it and it will not do that zeroing on any
of those.


Use of copy_from_user in msm_gem_submit.c while holding a spin_lock

2016-08-17 Thread Al Viro
On Wed, Aug 17, 2016 at 11:08:46AM -0400, Rob Clark wrote:
> On Wed, Aug 17, 2016 at 7:40 AM, Vaishali Thakkar
>  wrote:
> > Hello,
> >
> > I was wondering about the call to copy_from_user in function 
> > submit_lookup_objects for drive
> > /gpu/drm/msm/msm_gem_submit.c  It calls copy_from_user[1] in a spin_lock, 
> > which is not normally
> > allowed, due to the possibility of a deadlock.
> >
> > Is there some reason that I am overlooking why it is OK in this case? Is 
> > there some code in the
> > same file which ensures that page fault will not occur when we are calling 
> > the function holding
> > spin_lock?
> 
> hmm, probably just that it isn't typical to use a swap file on these
> devices (and that lockdep/etc doesn't warn about it)..  I guess we
> probably need some sort of slow-path where we drop the lock and try
> again in case there would be a fault..

Sigh...  Folks, you don't need swap *at* *all* for copy_from_user() to block.
/* get a zero-filled 64K buffer */
addr = mmap(NULL, 65536, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
if (addr < 0)
piss off
buffer = (void *)addr;

pass buf to a syscall
and copy_from_user() in that syscall will have to allocate pages (and possibly
page tables as well).  Which can block just fine, no swap involved.  Moreover,
if you modify some parts of the buffer first, you will get the pages containing
those modifications already present, but anything still untouched will
a) act as if it had been zeroed first and
b) possibly block on the first dereference, be it from kernel or from
userland.  Worse yet, there's nothing to stop libc from using the above for
calloc() and its ilk, with your application having no way to tell.  As far
as application is concerned, it has asked a library function to allocate and
zero a piece of memory, got one and yes, it does appear to be properly zeroed.

The bottom line is, copy_from_user() can realistically block, without
anything fishy going on in the userland setup.


[PATCH] fs: export simple_dname for drm drivers.

2014-01-20 Thread Al Viro
On Mon, Jan 20, 2014 at 10:54:44AM +1000, Dave Airlie wrote:
> David Herrmann's changes to use a pseudo filesystem for drm's shared
> inodes requires this be exported for drm to build as a module.
> 
> I'd like to merge this via the drm tree, so please ack.

Having looked through these patches...  The problem is that you'll get
your module impossible to unload.  Sane solution:

static struct vfsmount *drm_mnt;
static int count;
struct inode *drm_alloc_inode(void)
{
struct inode *res;
int err = simple_pin_fs(_fs, _mnt, );
if (err)
return ERR_PTR(err);
res = alloc_anon_inode(drm_mnt->mnt_sb);
if (IS_ERR(res))
simple_release_fs(_mnt, );
return res;
}

/* call from drm_free_dev() */
void drm_put_anon_inode(struct drm_device *dev)
{
iput(dev->anon_inode);
simple_release_fs(_mnt, );
}

and lose those kern_mount/kern_umount.


[PATCH] hfsplus: Remove hfsplus_file_lookup

2013-12-11 Thread Al Viro
On Wed, Dec 11, 2013 at 10:49:29PM +0300, Vyacheslav Dubeyko wrote:

> This feature worked earlier under Linux. So, I suppose that some changes in 
> HFS+ driver
> or in VFS broke it. And it needs to investigate and fix the reported issue. 
> Thank you for the
> report.

This "feature" is severely broken and yes, outright removal is what I'd
suggest for a fix.  HFS+ allows hardlinks to files, which means that
you allow multiple dentries for the same inode with ->lookup() in it,
which is asking for deadlocks.

This is fundamentally not supported.  Considering that forks are lousy
idea in the first place, I'd seriously suggest to remove that idiocy for
good.


Re: [PATCH v2 1/2] anon_inodes: allow external inode allocations

2013-10-10 Thread Al Viro
On Wed, Oct 02, 2013 at 01:24:25PM +0200, David Herrmann wrote:
 DRM core shares a single address_space across all inodes that belong to
 the same DRM device. This allows efficient unmapping of user-space
 mappings during buffer eviction. However, there is no easy way to get a
 shared address_space for DRM devices during initialization. Therefore, we
 currently delay this step until the first -open() and save the given
 inode for future use.
 
 This causes ugly delayed initialization throughout the DRM core. TTM
 devices end up without a dev_mapping pointer and we have to carefully
 respect any underlying filesystem implementation so we don't corrupt the
 inode-i_mapping and inode-i_data fields.
 
 We can avoid this if we were allowed to allocate an anonymous inode for
 each DRM device. We only have to set file-f_mapping during -open()
 and no longer need to adjust inode mappings. As fs/anon_inodes.c already
 provides a minimal internal FS mount, we extend it to also provide
 anonymous inodes for use in drivers like DRM.
 
 Signed-off-by: David Herrmann dh.herrm...@gmail.com
 Wanted-by: Daniel Vetter daniel.vet...@ffwll.ch
Told-to-stop-abusing-anon_inodes-by: Al Viro v...@zeniv.linux.org.uk

Note that anon_inode_getfile_private() is not going to survive - for exact
same reasons.  It's already removed in vfs.git#experimental, shortly in
#for-next as well.  Do NOT extend fs/anon_inodes.c; if you need a minimal
internal fs mount, just do what fs/aio.c does in the same branch.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] deadlock in drm/exynos: fix wrong pointer access at vm close

2013-09-26 Thread Al Viro
On Tue, Sep 24, 2013 at 01:41:00PM +0900, Inki Dae wrote:

 I can't see to hold -mmap_sem when it calls find_vma() anywhere else.

Er...  What, in your opinion, would protect the result of find_vma(), if
not that?  E.g. if another thread does munmap() on that area...  vma isn't
refcounted; there are only two things that can prevent its freeing -
mmap_sem being held and the lack of anybody else seeing the address of
mm_struct it belongs to (basically, when we are killing mm_struct off
or when we are populating a fresh mm_struct; in the latter case the
parent is locked, but the child doesn't need to).

Look at page fault handlers - they grab mmap_sem (shared) before looking for
vma.  And anything modifying the list of vmas (vm_mmap(), etc.) grabs it
exclusive...

  to caller) and clones it manually, regardless of whether that vma allows
  to clone itself or not.  Quite a few drivers rely on that not happening...
  
 
 I think that has no any problem because exynos_gem_get_vma() takes a
 reference to vma, and also v4l2 side is using same way. I and v4l2 guys
 might be missing something what you are concerning. So Could you give us
 more comments?

vb2_get_vma()/vb2_put_userptr() is also buggy.  At the very least, it
should refuse to handle ones with VM_DONTCOPY in flags.  Again, there
are drivers that seriously rely on VM_DONTCOPY being honoured.

BTW, what do you expect exynos_gem_get_pages_from_userptr() to do if
the area has been unmapped in the meanwhile?  Or, for that matter,
if that has been followed by mmap() of something with VM_IO on the
same range of addresses...  -open() does *NOT* prevent any of that.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] deadlock in drm/exynos: fix wrong pointer access at vm close

2013-09-26 Thread Al Viro
On Wed, Sep 25, 2013 at 01:34:30PM +0900, Inki Dae wrote:

 It seems that we can use a new anon file instead of using drm file to
 resolve the issue.

Could you describe what are you trying to achieve with that ioctl() and
what semantics do you want from normal mmap()?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[RFC] deadlock in drm/exynos: fix wrong pointer access at vm close

2013-09-23 Thread Al Viro
You have drm_dev-struct_mutex grabbed before -mmap_sem in
exynos_drm_gem_mmap_ioctl() and after - in exynos_drm_gem_fault()
(since -fault() is always called with -mmap_sem held).  Looks like
a garden-variety AB-BA deadlock...

Incidentally, what should happen if another process shares the
same opened file (e.g. inherited over fork()) and does mmap() just
as we have -f_op switched?
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [RFC] deadlock in drm/exynos: fix wrong pointer access at vm close

2013-09-23 Thread Al Viro
On Mon, Sep 23, 2013 at 04:49:30PM +0900, Inki Dae wrote:
 Hi,
 
  -Original Message-
  From: Al Viro [mailto:v...@ftp.linux.org.uk] On Behalf Of Al Viro
  Sent: Monday, September 23, 2013 6:29 AM
  To: YoungJun Cho
  Cc: dri-devel@lists.freedesktop.org; Inki Dae
  Subject: [RFC] deadlock in drm/exynos: fix wrong pointer access at vm
  close
  
  You have drm_dev-struct_mutex grabbed before -mmap_sem in
  exynos_drm_gem_mmap_ioctl() and after - in exynos_drm_gem_fault()
  (since -fault() is always called with -mmap_sem held).  Looks like
  a garden-variety AB-BA deadlock...
  
 
 Right, if mmap system call is requested by another process between -f_op
 pointer changing and restoring, the deadlock can be incurred.
 
 For this, I think we can resolve this issue like below,
 
 At exynos_drm_gem_mmap_ioctl()
   down_write(mm-mmap_sem);
   mutex_lock(dev-struct_mutex);
   ...

Umm...  If you do it that way, why bother with changing -private_data
at all?  You can as well stash obj in dev-dev_private-something after
you've grabbed the mutex and have -mmap() pick it there...

Said that, I really don't like that approach - both playing with -f_op
and the games with private vmas; exynos_gem_get_vma(), AFAICS, calls
find_vma() (without bothering to hold -mmap_sem, BTW - there's nothing
to prevent the result of find_vma() being freed just as it's returned
to caller) and clones it manually, regardless of whether that vma allows
to clone itself or not.  Quite a few drivers rely on that not happening...

IOW, you are already digging deep inside VM guts and this only makes
it deeper ;-/

And no, the deadlock doesn't depend on race between ioctl() and mmap()
from another process; all it takes is
1) thread A does clone(), creating thread B that shares address space with
it.
2) thread A does that ioctl, creating a mapping
3) thread A does that ioctl again, creating another mapping, while thread
B dereferences an address in the first mapping and triggers a page fault.

The only race is on step 3 in the above; the question about mmap() vs.
ioctl() was about mmap(2) _during_ that ioctl() hitting
exynos_drm_gem_mmap_buffer() instead of exynos_drm_gem_mmap() it would've
called before or after ioctl().  Here the interesting case is when
callers of mmap() and ioctl() do *not* share the address space, since
in that case mmap(2) won't notice -mmap_sem held by you - it's on the
different mm_struct, so mmap(2) will get to calling the -f_op-mmap()
without waiting for you to restore -f_op...

For another bug in the same area, try building that driver modular and
watch what happens to use count after a round of this switching -f_op and
restoring it back to original; fops_get() in there is wrong and AFAICS
pointless.
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


[git pull] drm fixes

2010-06-08 Thread Al Viro
On Mon, Jun 07, 2010 at 10:39:28PM +0100, David Woodhouse wrote:
> On Mon, 2010-06-07 at 14:17 -0700, Linus Torvalds wrote:
> > and that changelog doesn't really explain it either ("fix leak"? Ok, I can 
> > see the iput() fixing the leak - but you also did that jffs2_clear_inode() 
> > change, and that has no explanation what-so-ever.
> 
> jffs2_clear_inode() is the file system's ->clear_inode method, so it
> gets called from the VFS when the inode is destroyed, after iput().
> 
> I suppose that ought to have been a clue, right from the very beginning,
> that we should never have been calling it directly on our error paths.

Yep.  The other place that directly called its ->clear_inode() also had
been bogus, BTW - logfs had been playing rather sick games with special
inodes and ended up open-coding just about everything on new_inode/iput
paths for those.  They needed that stuff evicted after all normal inodes,
but before the second call of invalidate_inodes() would scream about
surviving busy inodes.  I.e. that should've been happening in ->put_super();
no need to deal with handcrafted inodes that would sit outside of inode
list...


[git pull] drm fixes

2010-06-07 Thread Al Viro
On Mon, Jun 07, 2010 at 02:17:23PM -0700, Linus Torvalds wrote:
>   jffs2_clear_inode(inode);
> 
> into
> 
>   make_bad_inode(inode);
>   iput(inode);
> 
> and that changelog doesn't really explain it either ("fix leak"? Ok, I can 
> see the iput() fixing the leak - but you also did that jffs2_clear_inode() 
> change, and that has no explanation what-so-ever.

The final iput() calls ->clear_inode() (jffs2_clear_inode in case of jffs2)
and the inode has just been created, with no other in-core references
existing.  Basically, that call was the only part of (required) iput() that
_was_ done there ;-)

FWIW, what's happening around ->clear_inode()/->delete_inode()/->drop_inode()
is a mess.  This leak got found when I'd been looking through that crap;
results of sanitizing are in #evict_inode (vfs-2.6.git).  I'm going to shift
that into for-next tomorrow, assuming it survives local beating.  For now
I've just pulled jffs2-fixes in it...


[git pull] drm fixes

2010-06-07 Thread Al Viro
On Mon, Jun 07, 2010 at 08:32:45PM +0100, David Woodhouse wrote:
> The fix is fairly trivial. There's a "big" patch to fs/jffs2/dir.c which
> accounts for the bulk of my pull request, but if you look harder you'll
> see it's mostly just a bunch of removing 'return ret;' and adding 
> 'goto fail;' so the error cleanup happens properly.
> 
> Al pointed out a second problem at the same time, fixed by commit
> e72e6497 in the tree I asked you to pull. That involved adding an
> unlock_new_inode() to the same error paths that the first patch used.
> 
> Between the two bugs, I figured it was worth pushing the fixes for
> 2.6.35.
> 
> The third jffs2 patch in that tree is a fix for ctime semantics which is
> a two-liner. Again not a regression but worth fixing, and -stable
> fodder.
> 
> Al also pointed out that I could use iget_failed(), but I figured that
> cleanup could wait for 2.6.36.

BTW, if you put jffs2 stuff into a separate queue, I can just pull it
(and add iget_failed() conversion on top of that).  Not a problem...


[git pull] drm fixes

2010-06-07 Thread Al Viro
On Mon, Jun 07, 2010 at 11:53:28AM -0700, Linus Torvalds wrote:
> 
> 
> On Mon, 7 Jun 2010, Al Viro wrote:
> > 
> > Ho-hum...  Speaking of which, what about leak fixes?  There's a 
> > long-standing
> > in-core inode leak in jffs2; basically, if you fail directory modification
> > in symlink() et.al., you get a leaked inode and whinge at umount.  Found
> > after -rc1, had been there since all the way back (similar bug in creat()
> > had been fixed in 2003, mkdir()/mknod()/symlink() were not).  Fix sits in
> > jffs2-fixes now...
> 
> I think a leak that is trivial easily falls under "security issue" as a 
> potential DoS issue.
> 
> On the other hand, if it's not trivially fixed (say it needs big 
> re-organizing of some locking or refcounting or whatever), and it's a 
> really slow leak of a pretty small data structure, and is not triggered by 
> normal users (say, you need to mount a filesystem or it needs some very 
> specific timing), I think it falls under "we haven't seen in the previous 
> five years, we might as well make sure the fix is tested in the next merge 
> window".

You need something like IO errors or device being full to trigger it.
As for the fix, it's basically a matter of "set i_nlink to 0 and iput()
instead of manual jffs2_clear_inode(); sure, you want to kill the on-disk
inode, but you want in-core one gone too".

Basically, that's what all local filesystems are doing to clean up after
such error and that's what jffs2 is doing for ->create().

As for the other stuff in that tree...  There's a fix for nfsd/create race
(rather narrow and not trivial to hit, but capable of fs corruption) and
there's mtd stuff I've no fscking clue about.

If not for the mtd part I'd simply pulled it in my tree.  As it is...  I
still can do that (done that for current semi-private branch), but I'd
prefer to avoid feeding mtd stuff through vfs tree, for all the obvious
reasons.


[git pull] drm fixes

2010-06-07 Thread Al Viro
On Mon, Jun 07, 2010 at 11:00:51AM -0700, Linus Torvalds wrote:
> So please. Just make me a tree that has regression fixes _only_. I'm not 
> AT ALL interested in "it is useful to report the gpu temp". If it was so 
> useful, and if it was ready before the merge window, it should hav gone in 
> then. That clearly wasn't the case, so it's not going in now either.

Ho-hum...  Speaking of which, what about leak fixes?  There's a long-standing
in-core inode leak in jffs2; basically, if you fail directory modification
in symlink() et.al., you get a leaked inode and whinge at umount.  Found
after -rc1, had been there since all the way back (similar bug in creat()
had been fixed in 2003, mkdir()/mknod()/symlink() were not).  Fix sits in
jffs2-fixes now...

I can simply pull jffs2-fixes into vfs for-next (I need it in there for
->evict_inode() series), but I'd obviously prefer to just rebase it after
it gets into mainline.


Re: [git pull] drm fixes

2010-06-07 Thread Al Viro
On Mon, Jun 07, 2010 at 08:32:45PM +0100, David Woodhouse wrote:
 The fix is fairly trivial. There's a big patch to fs/jffs2/dir.c which
 accounts for the bulk of my pull request, but if you look harder you'll
 see it's mostly just a bunch of removing 'return ret;' and adding 
 'goto fail;' so the error cleanup happens properly.
 
 Al pointed out a second problem at the same time, fixed by commit
 e72e6497 in the tree I asked you to pull. That involved adding an
 unlock_new_inode() to the same error paths that the first patch used.
 
 Between the two bugs, I figured it was worth pushing the fixes for
 2.6.35.
 
 The third jffs2 patch in that tree is a fix for ctime semantics which is
 a two-liner. Again not a regression but worth fixing, and -stable
 fodder.
 
 Al also pointed out that I could use iget_failed(), but I figured that
 cleanup could wait for 2.6.36.

BTW, if you put jffs2 stuff into a separate queue, I can just pull it
(and add iget_failed() conversion on top of that).  Not a problem...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel


Re: [git pull] drm fixes

2010-06-07 Thread Al Viro
On Mon, Jun 07, 2010 at 10:39:28PM +0100, David Woodhouse wrote:
 On Mon, 2010-06-07 at 14:17 -0700, Linus Torvalds wrote:
  and that changelog doesn't really explain it either (fix leak? Ok, I can 
  see the iput() fixing the leak - but you also did that jffs2_clear_inode() 
  change, and that has no explanation what-so-ever.
 
 jffs2_clear_inode() is the file system's -clear_inode method, so it
 gets called from the VFS when the inode is destroyed, after iput().
 
 I suppose that ought to have been a clue, right from the very beginning,
 that we should never have been calling it directly on our error paths.

Yep.  The other place that directly called its -clear_inode() also had
been bogus, BTW - logfs had been playing rather sick games with special
inodes and ended up open-coding just about everything on new_inode/iput
paths for those.  They needed that stuff evicted after all normal inodes,
but before the second call of invalidate_inodes() would scream about
surviving busy inodes.  I.e. that should've been happening in -put_super();
no need to deal with handcrafted inodes that would sit outside of inode
list...
___
dri-devel mailing list
dri-devel@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/dri-devel