Re: [GIT PULL] remoteproc updates for v6.9

2024-03-21 Thread Linus Torvalds
On Thu, 21 Mar 2024 at 11:03, Bjorn Andersson  wrote:
>
> I was further notified that this conflicts with your tree, Linus. Below
> is the resolution for this conflict.

Heh. This email came in after the pr-tracker-bot email notifying you
that it's already done..

I think I got it all right, it didn't seem at all controversial, but
maybe you should double-check.

  Linus



Re: [GIT PULL] virtio: features, fixes

2024-03-19 Thread Linus Torvalds
On Tue, 19 Mar 2024 at 00:41, Michael S. Tsirkin  wrote:
>
> virtio: features, fixes
>
> Per vq sizes in vdpa.
> Info query for block devices support in vdpa.
> DMA sync callbacks in vduse.
>
> Fixes, cleanups.

Grr. I thought the merge message was a bit too terse, but I let it slide.

But only after pushing it out do I notice that not only was the pull
request message overly terse, you had also rebased this all just
moments before sending the pull request and didn't even give a hit of
a reason for that.

So I missed that, and the merge is out now, but this was NOT OK.

Yes, rebasing happens. But last-minute rebasing needs to be explained,
not some kind of nasty surprise after-the-fact.

And that pull request explanation was really borderline even *without*
that issue.

Linus



Re: [PATCH 0/6] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-08 Thread Linus Torvalds
On Fri, 8 Mar 2024 at 13:39, Linus Torvalds
 wrote:
>
> So the above "complexity" is *literally* just changing the
>
>   (new = atomic_read_acquire(>seq)) != old
>
> condition to
>
>   should_exit ||
>   (new = atomic_read_acquire(>seq)) != old

.. and obviously you'll need to add the exit condition to the actual
"deal with events" loop too.

Linus



Re: [PATCH 0/6] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-08 Thread Linus Torvalds
On Fri, 8 Mar 2024 at 13:33, Steven Rostedt  wrote:
>
> There's two layers:
>
> 1) the ring buffer has the above simple producer / consumer.
>Where the wake ups can happen at the point of where the buffer has
>the amount filled that the consumer wants to start consuming with.
>
> 2) The tracing layer; Here on close of a file, the consumers need to be
>woken up and not wait again. And just take whatever was there to finish
>reading.
>
>There's also another case that the ioctl() just kicks the current
>readers out, but doesn't care about new readers.

But that's the beauty of just using the wait_event() model.

Just add that "exit" condition to the condition.

So the above "complexity" is *literally* just changing the

  (new = atomic_read_acquire(>seq)) != old

condition to

  should_exit ||
  (new = atomic_read_acquire(>seq)) != old

(replace "should_exit" with whatever that condition is, of course) and
the wait_event() logic will take care of the rest.

 Linus



Re: [PATCH 0/6] tracing/ring-buffer: Fix wakeup of ring buffer waiters

2024-03-08 Thread Linus Torvalds
On Fri, 8 Mar 2024 at 10:38, Steven Rostedt  wrote:
>
> A patch was sent to "fix" the wait_index variable that is used to help with
> waking of waiters on the ring buffer. The patch was rejected, but I started
> looking at associated code. Discussing it on IRC with Mathieu Desnoyers
> we discovered a design flaw.

Honestly, all of this seems excessively complicated.

And your new locking shouldn't be necessary if you just do things much
more simply.

Here's what I *think* you should do:

  struct xyz {
...
atomic_t seq;
struct wait_queue_head seq_wait;
...
  };

with the consumer doing something very simple like this:

int seq = atomic_read_acquire(>seq);
for (;;) {
.. consume outstanding events ..
seq = wait_for_seq_change(seq, my);
}

and the producer being similarly trivial, just having a
"add_seq_event()" at the end:

... add whatever event ..
add_seq_event(my);

And the helper functions for this are really darn simple:

  static inline int wait_for_seq_change(int old, struct xyz *my)
  {
int new;
wait_event(my->seq_wait,
(new = atomic_read_acquire(>seq)) != old);
return new;
  }

  static inline void add_seq_event(struct xyz *my)
  {
atomic_fetch_inc_release(>seq);
wake_up(>seq_wait);
  }

Note how you don't need any new locks, and note how "wait_event()"
will do all the required optimistic stuff for you (ie it will check
that "has seq changed" before even bothering to add itself to the wait
queue etc).

So the above is not only short and sweet, it generates fairly good
code too, and doesn't it look really simple and fairly understandable?

And - AS ALWAYS - the above isn't actually tested in any way, shape or form.

 Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-31 Thread Linus Torvalds
On Wed, 31 Jan 2024 at 07:58, Steven Rostedt  wrote:
>
> BTW, I ran my full test suite on your patches with the below updates and it
> all passed.

Those patch updates all look sane to me.

> I can break up and clean up the patches so that they are bisectable, and if
> that passes the bisectable portion of my tests, I can still send them to
> you for 6.8.

Ack. That series you posted looks fine. I didn't do any actual testing
or applying the patches, just looking at them.

The one thing I noticed is that the 'llist' removal still needs to be
done. The logical point is that "[PATCH v2 7/7]" where the
eventfs_workfn stuff is ripped out.

And the 'rcu' head should now be a union with something that is no
longer used after the last kref. The only thing that *is* used after
the last kref is the "is_freed" bit, so there's lots of choice. Using
the 'struct list_head listl' that is used for the child list would
seem to be the obvious choice, but it could be anything (including all
of the beginning of that eventfs_inode, but then you would need to
group that as another nested unnamed struct, so picking a "big enough"
entry like 'list' makes it syntactically simpler.

   Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-31 Thread Linus Torvalds
On Wed, 31 Jan 2024 at 05:14, Steven Rostedt  wrote:
>
> If you also notice, tracefs only allows mkdir/rmdir to be assigned to
> one directory. Once it is assigned, no other directories can have mkdir
> rmdir functionality.

I think that does limit the damage, but it's not clear that it is actually safe.

Because you don't hold the inode lock, somebody could come in and do a
mkdir inside the other one that is being removed, ie something like

 - thread 1 does took the inode lock, called ->rmdir

 - it then drops the inode lock (both parent and the directory that is
getting removed) and gets the event lock

 - but thread 2 can come in between that inode lock drop and event lock

Notice: thread 2 comes in where there is *no* locking. Nada. Zilch.

This is what worries me.

But it does help that it's all only in *one* directory.  At least
another mkdir cannot happen *inside* the one that is going away while
the locks are not held. So the good news is that it does mean that
there's only one point that is being protected.

But I do worry about things like this (in vfs_rmdir()):

inode_lock(dentry->d_inode);

error = -EBUSY;
if (is_local_mountpoint(dentry) ||
(dentry->d_inode->i_flags & S_KERNEL_FILE))
goto out;

error = security_inode_rmdir(dir, dentry);
if (error)
goto out;

error = dir->i_op->rmdir(dir, dentry);
if (error)
goto out;

notice how it does that "is this a mount point" test atomically wrt
the rmdir before it is allowed to proceed.

And I do think that the inode lock is what also protects it from
concurrent mounts. So now what happens when that "thread 2" above
comes in while there is *no* locking, and mounts something there?

Now, I'm not saying this is a huge problem. But it's very much an
example of a thing that *could* be a problem. Dropping locks in the
middle is just very scary.

   Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 21:57, Linus Torvalds
 wrote:
>
> Ugh.

Oh, and double-ugh on that tracefs_syscall_mkdir/rmdir(). I hate how
it does that "unlock and re-lock inode" thing.

It's a disease, and no, it's not an acceptable response to "lockdep
shows there's a locking problem".

The comment says "It is up to the individual mkdir routine to handle
races" but that's *completely* bogus and shows a fundamental
misunderstanding of locking.

Dropping the lock in the middle of a locked section does NOT affect
just the section that you dropped the lock around.

It affects the *caller*, who took the lock in the first place, and who
may well assume that the lock protects things for the whole duration
of the lock.

And so dropping the lock in the middle of an operation is a bad BAD
*BAD* thing to do.

Honestly, I had only been looking at the eventfs_inode.c lookup code.
But now that I started looking at mkdir/rmdir, I'm seeing the same
signs of horrible mistakes there too.

And yes, it might be a real problem. For example, for the rmdir case,
the actual lock was taken by 'vfs_rmdir()', and it does *not* protect
only the ->rmdir() call itself.

It also, for example, is supposed to make the ->rmdir be atomic wrt things like

dentry->d_inode->i_flags |= S_DEAD;

and

dont_mount(dentry);
detach_mounts(dentry);

but now because the inode lock was dropped in the middle, for all we
know a "mkdir()" could come in on the same name, and make a mockery of
the whole inode locking.

The inode lock on that directory that is getting removed is also what
is supposed to make it impossible to add new files to the directory
while the rmdir is going on.

Again, dropping the lock violates those kinds of guarantees.

"git blame" actually fingers Al for that "unlock and relock", but
that's because Al did a search-and-replace on
"mutex_[un]lock(>i_mutex)" and replaced it with
"inode_[un]lock(inode)" back in 2016.

The real culprit is you, and that sh*t-show goes back to commit
277ba04461c2 ("tracing: Add interface to allow multiple trace
buffers").

Christ. Now I guess I need to start looking if there is anything else
horrid lurking in that mkdir/rmdir path.

I doubt the inode locking problem ends up mattering in this situation.
Mostly since this is only tracefs, and normal users shouldn't be able
to access these things anyway. And I think (hope?) that you only
accept mkdir/rmdir in specific places.

But this really is another case of "This code smells *fundamentally* wrong".

Adding Al, in the hopes that he will take a look at this too.

Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 21:33, Steven Rostedt  wrote:
>
> With even the last patch included, without the d_invalidate() I get errors
> with simply doing:
>
>  # cd /sys/kernel/tracing
>  # mkdir instances/foo
>  # ls instances/foo/events
>  # rmdir instances/foo
>
> As the rmdir calls tracefs_remove() that calls simple_recursive_removal()
> that then walks into the "events" directory. Without that d_invalidate, it
> walks beyond just the top directory and then splats on the dentries that
> are cached.

Ugh.

This is only an issue for that "events" dir, right? The one that has a
proper refcount on the dentry in 'ei->events_dir'?

Because yes, then doing d_invalidate() looks like the right thing.

   Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 21:09, Steven Rostedt  wrote:
>
> I would think that the last "put" would always have the "is_freed" set. The
> WARN_ON is to catch things doing too many put_ei().

Yes, that looks sane to me.

> > + simple_recursive_removal(dentry, NULL);
>
> Actually, this doesn't work.

Yes, note how the next patch just removed it entirely.


> Does this work:
>
> d_invalidate(dentry);

It does, but it's basically irrelevant with the d_revalidate approach.

Basically, once you have d_revalidate(), the unhashing happens there,
and it's just extra work and pointless to do it elsewhere.

So if you look at the "clean up dentry ops and add revalidate
function" patch, you'll see that it just does

-   simple_recursive_removal(dentry, NULL);

and the thing is just history.

So really, that final patch is the one that fixes the whole eventfs
mess for good (knock wood). But you can't do it first, because it
basically depends on all the refcount fixes.

It might be possible to re-organize the patches so that the refcount
changes go first, then the d_revalidate(), and then the rest. But I
suspect they all really end up depending on each other some way,
because the basic issue was that the whole "keep unrefcounted dentry
pointers around" was just wrong.  So it doesn't end up right until
it's _all_ fixed, because every step of the way exposes some problem.

At least that was my experience. Fix one thing, and it exposes the
hack that another thing depended on.

This is actually something that Al is a master at. You sometimes see
him send one big complicated patch where he talks about all the
problems in some area and it's one huge "fix up everything patch" that
looks very scary.

And then a week later he sends a series of 19 patches that all make
sense and all look "obvious" and all make small progress.

And magically they end up matching that big cleanup patch in the end.
And you just *know* that it didn't start out as that beautiful logical
series, because you saw the big messy patch first...

Linus



Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 18:46, Al Viro  wrote:
>
> What's to stop ->d_revalidate() from being called in parallel with
> __dentry_kill()?

Oh, you're right.

For some reason I thought we did the d_release() _after_ the RCU grace
period, but we don't.

Why don't we, btw? It would be so much better if we did the
d_release() from __d_free(). We have all that smarts in fs/dcache.c to
decide if we need to RCU-delay it or not, and then we don't let
filesystems use it.

I assume the reason is that some 'd_delete' cases might want to sleep
etc. Still, for things like this that just want to release memory, it
would be *much* better to have d_release called when the dentry is
actually released.

Hmm. Not very many d_delete cases, but I did see a couple that
definitely want process context (dma_buf_release goes to things that
do vfree() etc).

So I guess the "make low-level filesystems do their own kfree_rcu() is
what we're doing.

In this case it's as simple as doing that

-   kfree(ei);
+   kfree_rcu(ei, rcu);

and we'd just make the rcu entry a union with something that isn't
that 'is_freed' field so that it doesn't take more space.

 Linus



Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 17:12, Al Viro  wrote:
>
> > + *
> > + * Note that d_revalidate is called potentially under RCU,
> > + * so it can't take the eventfs mutex etc. It's fine - if
> > + * we open a file just as it's marked dead, things will
> > + * still work just fine, and just see the old stale case.
>
> Looks like use after free, unless freeing ei is RCU-delayed...

We hold the ref to the ei in the very dentry that is doing d_revalidate().

So it should be fine. The race is with eventfs marking the ei
'is_freed' (under the mutex that we don't hold here), but when that
happens and we end up still using the dentry, the ei is still there,
all the operations are just going to fail.

 Linus



Re: [PATCH 3/6] tracefs: dentry lookup crapectomy

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 16:23, Al Viro  wrote:
>
> On Tue, Jan 30, 2024 at 11:03:52AM -0800, Linus Torvalds wrote:
> >  {
> >   struct eventfs_inode *ei;
> >
> > - mutex_lock(_mutex);
> >   do {
> >   // The parent is stable because we do not do renames
> >   dentry = dentry->d_parent;
> > @@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
> > dentry *dentry)
> >   }
> >   // Walk upwards until you find the events inode
> >   } while (!ei->is_events);
> > - mutex_unlock(_mutex);
>
> Unless I'm missing something, you've just lost exclusion with
> removals (not that the original hadn't been suspicious in that
> respect - what's to protect ei past that mutex_unlock?

No, the mutex is actually taken up the call chain, and removing it
here is fixing a deadlock.

Ie we have the mutex taken in eventfs_root_lookup(), and then that
goes either through lookup_dir_entry() or lookup_file_dentry(), before
it gets to update_inode_attr().

That series is not very clean, I'm afraid.

  Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 15:10, Steven Rostedt  wrote:
> >
> > At which point a name pointer would *just* fit in 96 bytes.
>
> Does that mean I should keep the kstrdup_const()?

You should check that my math matches reality and relevant
configurations, but yes, at 96 bytes that should fit exactly in one
slab entry on both x86-64 and arm64 (now that arm finally does sane
kmalloc sizes - for the longest time they were all 128-byte aligned
due to historical horrible DMA coherency issues).

But you might also add a comment to the eventfs_inode definition,
because if that ever changes, the math changes again. For example,
adding just one more attribute data would make it all fall apart.

If you *really* want to optimize that data structure, I think you can
make the child list be a 'hlist'. That makes the list head ('children'
becomes a 'hlist_head') smaller, even if the list entry ('list'
becomes a 'hlist_node') stays the same size.

 Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 14:56, Linus Torvalds
 wrote:
>
> With that, the base size of 'struct eventfs_inode' actually becomes 96
> bytes for me.

It can be shrunk some more.

The field ordering is suboptimal. Pointers are 8 bytes and 8-byte
aligned, but 'struct kref' is just 4 bytes, and 'struct eventfs_attr'
is 12 bytes and 4-byte aligned.

So if you pack all the 8-byte-aligned fields at the beginning, and the
4-byte-aligned ones at the end, you get 88 bytes.

At which point a name pointer would *just* fit in 96 bytes.

...  and then some debug option is enabled, and it all goes to hell again.

  Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 14:55, Steven Rostedt  wrote:
>
> Remember, the files don't have an ei allocated. Only directories.

Crossed emails.

> I then counted the length of the string - 7 bytes (which is the same as the
> length of the string plus the '\0' character - 8 bytes) to accommodate the
> pointer size, and it's a savings of 22KB per instance. And in actuality, I
> should have rounded to the nearest kmalloc slab size as kzalloc() isn't going 
> to
> return 35 bytes back but 64 bytes will be allocated.

No. See my other email. The kmalloc sizes actually means that it comes
out exactly even.

 Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 13:52, Linus Torvalds
 wrote:
>
> Btw, you can look at name lengths in tracefs with something stupid like this:
>
> find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c

Actually, since only directories have these 'ei' fields, that find
should have "-type d", and then the stats change.

Most directory filenames there are indeed longer than 8 bytes (16
bytes seems to be the most common length).

That means that whether it wins or loses in allocation size tends to
be mostly about the kmalloc sizes and the size of that structure.

And it turns out that the real memory savings there would be this patch

  --- a/fs/tracefs/internal.h
  +++ b/fs/tracefs/internal.h
  @@ -55,15 +55,6 @@ struct eventfs_inode {
unsigned intis_events:1;
unsigned intnr_entries:30;
unsigned intino;
  - /*
  -  * Union - used for deletion
  -  * @llist:  for calling dput() if needed after RCU
  -  * @rcu:eventfs_inode to delete in RCU
  -  */
  - union {
  - struct llist_node   llist;
  - struct rcu_head rcu;
  - };
const char name[];
   };

since with all the other cleanups, neither of those fields are actually used.

With that, the base size of 'struct eventfs_inode' actually becomes 96
bytes for me.

And at least on x86-64, the kmalloc sizes are 96 and then 128 bytes.

But that means that

 - with the added name pointer, the eventfs_inode would grow to 104
bytes, and grow to that next allocation size (128)

 - with the embedded name, the size is 96+strlen+1, and will also need
at least a 128 byte allocation, but any name that is shorter than 32
bytes is free

so it ends up being a wash. You're going to allocate 128 bytes
regardless. But the deallocation is simpler.

  Linus



Re: [PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 12:55, Steven Rostedt  wrote:
>
> I'm going to be putting back the ei->name pointer as the above actually
> adds more memory usage.

I did it mainly because I hate having multiple different allocation
sites that then have to do that kref_init() etc individually, and once
there was a single site the "name" thing really looked lik ean obvious
simplification.

That said, I think you're confused about the memory usage.

Sure, 'kstrdup_const()' optimizes away the allocation for static
constant strings, but what it does *not* do is to optimize away the
pointer.

In contrast, allocating them together gets rid of the pointer itself,
because now the name is just an offset in the structure.

And the pointer is 8 bytes right there.

So allocating the string _with_ the ei will always save at least 8 bytes.

So whenever the string is less than that in length it's *always* a win.

And even when it's not an absolute win, it will take advantage of the
kmalloc() quantized sizes too, and generally not be a loss even with
longer names.

So I'm pretty sure you are simply entirely wrong on the memory usage.
Not counting the size of the pointer is overlooking a big piece of the
puzzle.

Btw, you can look at name lengths in tracefs with something stupid like this:

find . | sed 's:^.*/::' | tr -c '\n' . | sort | uniq -c

and you will see that (at least in my case) names of lengths 1..7 are
dominating it all:

  1 .
   2189 ..
 34 ...
   2229 
207 .
   6833 ..
   2211 ...

with the rest being insignificant in comparison.

The reason? There's a *lot* of those 'filter' and 'enable' and 'id'
files. All of which are better off without a 'const char *name' taking
8 bytes.

  Linus



Re: [PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 13:25, Steven Rostedt  wrote:
>
> I actually find the dentry's sticking around when their ref counts go to
> zero a feature. Tracing applications will open and close the eventfs files
> many times. If the dentry were to be deleted on every close, it would need
> to be create over again in short spurts.

Sure. I think this is literally a tuning thing, and we'll see if
anybody ever says "the dentry cache grows too much with tracefs".

   Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 11:37, Steven Rostedt  wrote:
>
> Do you want me to put this in my urgent branch and mark them for stable,
> and then send them for 6.8?

Hmm. I think the only one that fixes a _reported_ bug is that [PTCH
2/6]. And it turns out that while 'ti->private' really is entirely
uninitialized (and I still think it's the cause of the kernel test
robot report that started this thread), the ti->flags field _is_
initialized to zero in tracefs_alloc_inode().

So even in that [PATCH 2/6], these parts:

  - ti->flags |= TRACEFS_EVENT_INODE;
  + ti->flags = TRACEFS_EVENT_INODE;

aren't strictly needed (but aren't wrong either).

The 'make sure to initialize ti->private before exposing the dentry'
part *definitely* needs to be part of 6.8, though. That has an
outstanding actually triggered bug report on it.

I do think that tracefs_alloc_inode() should also initialize
ti->private to NULL, but while that would fix the oops that the test
robot reported, it wouldn't fix the data-race on any ti->private
accesses.

So that "ti->private = ei" needs to be done before the d_instantiate()
(that later became a d_add()) regardless. But not having random fields
left uninitialized for future subtle bugs would be a good idea too.

Anyway.

If you do run the full tracefs tests on the whole series, and there
are no other major problems, I'll happily take it all for 6.8. And
yes, even mark it for stable. I think the other bugs are much harder
to hit, but I do think they exist. And code deletion is always good.

So give it the full test attention, and *if* it all still looks good
and there are no new subtle issues that crop up, let's just put this
saga behind us asap.

   Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 10:23, Steven Rostedt  wrote:
>
> I know you don't send patches inlined anymore, which is a shame, because
> patchwork takes care of all the administering when patches are inlined, and
> I don't miss patches like I use to.

I just sent the whole series as individual patches., and doing a

  b4 am 20240130190355.11486-1-torva...@linux-foundation.org

should get them all (or if you use patchwork, do that)

I don't normally do these inlined patches any more, because honestly,
99% of all patches I do end up being "I can't test this very well, I
think you should do something like this".

In fact, for simple ones where I might not have even compile-tested
them, much less booted into a kernel with them, I will actively
whitespace-corrupt the patch, just to make sure they aren't applied as
any kind of real workflow - they are almost always meant as a "I think
you should do this, and take credit for it all".

And so when I'm working on a series like this, I'll send attachments
just because it's easier, and because I don't want to patch-bomb
people with some kind of crazy work-in-progress thing.

But I'm reasonably comfortable with this series now, so I sent it as a
"real" patch series. I like it partly because it just removes a lot of
lines:

 3 files changed, 160 insertions(+), 433 deletions(-)

but mostly because the lines it removes are what I consider actively
broken code. So it's not just getting rid of LOC, it's getting rid of
complexity (and bugs) IMHO.

That said, I also don't think that patch series is any kind of
"final". I didn't fix up the readdir iterator locking, for example. I
don't think the SRCU parts are needed at all any more thanks to the
refcounting - the 'ei' is no longer protected by SRCU, it's protected
by virtue of us having the file open (and thus holding the dentry).

So please think of that series not as any kind of final word. More as
a "I strongly believe this is the direction eventfs should go".

I am perfectly ok with you munging those patches and taking full
credit for them, for example.

My "testing" has not involved any real tracing usage, and while I
*have* booted that thing, and have done some very basic smoke-testing
in /sys/kernel/tracing, 99% of the work was literally me just going
through the lookup code, removing everything I found objectionable,
and replacing it with what the VFS layer generally would want.

  Linus



[PATCH 6/6] eventfs: clean up dentry ops and add revalidate function

2024-01-30 Thread Linus Torvalds
In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 21 +
 fs/tracefs/inode.c   | 27 ++-
 fs/tracefs/internal.h|  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char 
*name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an 
eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry 
*dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-   struct eventfs_inode *ei;
-
-   mutex_lock(_mutex);
-   ei = dentry->d_fsdata;
-   if (ei) {
-   dentry->d_fsdata = NULL;
-   put_ei(ei);
-   }
-   mutex_unlock(_mutex);
+   put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
}
 
  enoent:
-   /* Nothing found? */
-   d_add(dentry, NULL);
-   result = NULL;
+   /* Don't cache negative lookups, just return an error */
+   result = ERR_PTR(-ENOENT);
 
  out:
mutex_unlock(_mutex);
@@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 * sticks around while the other ei->dentry are created
 * and destroyed dynamically.
 */
-   simple_recursive_removal(dentry, NULL);
dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@ static const struct super_operations 
tracefs_super_operations = {
.show_options   = tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-   struct tracefs_inode *ti;
+   if (dentry->d_fsdata)
+   eventfs_d_release(dentry);
+}
 
-   if (!dentry || !inode)
-   return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+   struct eventfs_inode *ei = dentry->d_fsdata;
 
-   ti = get_tracefs(inode);
-   if (ti && ti->flags & TRACEFS_EVENT_INODE)
-   eventfs_set_ei_status_free(ti, dentry);
-   iput(inode);
+   return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-   .d_iput = tracefs_dentry_iput,
+   .d_revalidate = tracefs_d_revalidate,
+   .d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dentry *eventfs_failed_creating(struct dentry *dentry);
 struct dentry *eventfs_end_creating(struct dentry *dentry);
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry 
*dentry);
+
+void eventfs_d_release(struct dentry *dentry);
 
 #endif /* _TRACEFS_INTERNAL_H */
-- 
2.43.0.5.g38fb137bdb




[PATCH 5/6] eventfs: get rid of dentry pointers without refcounts

2024-01-30 Thread Linus Torvalds
The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer.  That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.

There were two reasonms why eventfs tried to keep a pointer to a dentry:

 - the creation of a 'events' directory would actually have a stable
   dentry pointer that it created with tracefs_start_creating().

   And it needed that dentry when tearing it all down again in
   eventfs_remove_events_dir().

   This use is actually ok, because the special top-level events
   directory dentries are actually stable, not just a temporary cache of
   the eventfs data structures.

 - the 'eventfs_inode' (aka ei) needs to stay around as long as there
   are dentries that refer to it.

   It then used these dentry pointers as a replacement for doing
   reference counting: it would try to make sure that there was only
   ever one dentry associated with an event_inode, and keep a child
   dentry array around to see which dentries might still refer to the
   parent ei.

This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.

The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei.  As
does the directory dentries.  That makes the broken use case go away.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 245 ---
 fs/tracefs/internal.h|   9 +-
 2 files changed, 80 insertions(+), 174 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1d0102bfd7da..a37db0dac302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -62,6 +62,34 @@ enum {
 
 #define EVENTFS_MODE_MASK  (EVENTFS_SAVE_MODE - 1)
 
+/*
+ * eventfs_inode reference count management.
+ *
+ * NOTE! We count only references from dentries, in the
+ * form 'dentry->d_fsdata'. There are also references from
+ * directory inodes ('ti->private'), but the dentry reference
+ * count is always a superset of the inode reference count.
+ */
+static void release_ei(struct kref *ref)
+{
+   struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, 
kref);
+   kfree(ei->entry_attrs);
+   kfree(ei);
+}
+
+static inline void put_ei(struct eventfs_inode *ei)
+{
+   if (ei)
+   kref_put(>kref, release_ei);
+}
+
+static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
+{
+   if (ei)
+   kref_get(>kref);
+   return ei;
+}
+
 static struct dentry *eventfs_root_lookup(struct inode *dir,
  struct dentry *dentry,
  unsigned int flags);
@@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct 
inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *lookup_file(struct dentry *dentry,
+static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
+ struct dentry *dentry,
  umode_t mode,
  struct eventfs_attr *attr,
  void *data,
@@ -302,11 +331,11 @@ static struct dentry *lookup_file(struct dentry *dentry,
mode |= S_IFREG;
 
if (WARN_ON_ONCE(!S_ISREG(mode)))
-   return NULL;
+   return ERR_PTR(-EIO);
 
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
-   return eventfs_failed_creating(dentry);
+   return ERR_PTR(-ENOMEM);
 
/* If the user updated the directory's attributes, use them */
update_inode_attr(dentry, inode, attr, mode);
@@ -322,9 +351,12 @@ static struct dentry *lookup_file(struct dentry *dentry,
ti->flags = TRACEFS_EVENT_INODE;
ti->private = NULL; // Directories have 'ei', files 
not
 
+   // Files have their parent's ei as their fsdata
+   dentry->d_fsdata = get_ei(parent_ei);
+
d_add(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
-   return eventfs_end_creating(dentry);
+   return NULL;
 };
 
 /**
@@ -343,7 +375,7 @@ static struct dentry *lookup_dir_entry(struct dentry 
*dentry,
 
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
-   return eventfs_failed_creating(dentry);
+   return ERR_PTR(-ENOMEM);
 
/* If the user updated the directory's attributes, use them */
update_inode_attr(dentry, inode, >attr,
@@ -359,24 +391,28 @@ static struct dentry *lookup_dir_entry(struct dentry 

[PATCH 4/6] eventfs: remove unused 'd_parent' pointer field

2024-01-30 Thread Linus Torvalds
It's never used

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 4 +---
 fs/tracefs/internal.h| 2 --
 2 files changed, 1 insertion(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index ad11063bdd53..1d0102bfd7da 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -685,10 +685,8 @@ struct eventfs_inode *eventfs_create_dir(const char *name, 
struct eventfs_inode
INIT_LIST_HEAD(>list);
 
mutex_lock(_mutex);
-   if (!parent->is_freed) {
+   if (!parent->is_freed)
list_add_tail(>list, >children);
-   ei->d_parent = parent->dentry;
-   }
mutex_unlock(_mutex);
 
/* Was the parent freed? */
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 91c2bf0b91d9..8f38740bfb5b 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -35,7 +35,6 @@ struct eventfs_attr {
  * @name:  the name of the directory to create
  * @children:  link list into the child eventfs_inode
  * @dentry: the dentry of the directory
- * @d_parent:   pointer to the parent's dentry
  * @d_children: The array of dentries to represent the files when created
  * @entry_attrs: Saved mode and ownership of the @d_children
  * @attr:  Saved mode and ownership of eventfs_inode itself
@@ -50,7 +49,6 @@ struct eventfs_inode {
const char  *name;
struct list_headchildren;
struct dentry   *dentry; /* Check is_freed to access */
-   struct dentry   *d_parent;
struct dentry   **d_children;
struct eventfs_attr *entry_attrs;
struct eventfs_attr attr;
-- 
2.43.0.5.g38fb137bdb




[PATCH 3/6] tracefs: dentry lookup crapectomy

2024-01-30 Thread Linus Torvalds
The dentry lookup for eventfs files was very broken, and had lots of
signs of the old situation where the filesystem names were all created
statically in the dentry tree, rather than being looked up dynamically
based on the eventfs data structures.

You could see it in the naming - how it claimed to "create" dentries
rather than just look up the dentries that were given it.

You could see it in various nonsensical and very incorrect operations,
like using "simple_lookup()" on the dentries that were passed in, which
only results in those dentries becoming negative dentries.  Which meant
that any other lookup would possibly return ENOENT if it saw that
negative dentry before the data rwas then later filled in.

You could see it in the immesnse amount of nonsensical code that didn't
actually just do lookups.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 275 ---
 1 file changed, 52 insertions(+), 223 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index c0d977e6c0f2..ad11063bdd53 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
 {
struct eventfs_inode *ei;
 
-   mutex_lock(_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
-   mutex_unlock(_mutex);
 
update_top_events_attr(ei, dentry->d_sb);
 
@@ -280,11 +278,10 @@ static void update_inode_attr(struct dentry *dentry, 
struct inode *inode,
 }
 
 /**
- * create_file - create a file in the tracefs filesystem
- * @name: the name of the file to create.
+ * lookup_file - look up a file in the tracefs filesystem
+ * @dentry: the dentry to look up
  * @mode: the permission that the file should have.
  * @attr: saved attributes changed by user
- * @parent: parent dentry for this file.
  * @data: something that the caller will want to get to later on.
  * @fop: struct file_operations that should be used for this file.
  *
@@ -292,13 +289,13 @@ static void update_inode_attr(struct dentry *dentry, 
struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *create_file(const char *name, umode_t mode,
+static struct dentry *lookup_file(struct dentry *dentry,
+ umode_t mode,
  struct eventfs_attr *attr,
- struct dentry *parent, void *data,
+ void *data,
  const struct file_operations *fop)
 {
struct tracefs_inode *ti;
-   struct dentry *dentry;
struct inode *inode;
 
if (!(mode & S_IFMT))
@@ -307,12 +304,6 @@ static struct dentry *create_file(const char *name, 
umode_t mode,
if (WARN_ON_ONCE(!S_ISREG(mode)))
return NULL;
 
-   WARN_ON_ONCE(!parent);
-   dentry = eventfs_start_creating(name, parent);
-
-   if (IS_ERR(dentry))
-   return dentry;
-
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
return eventfs_failed_creating(dentry);
@@ -331,29 +322,25 @@ static struct dentry *create_file(const char *name, 
umode_t mode,
ti->flags = TRACEFS_EVENT_INODE;
ti->private = NULL; // Directories have 'ei', files 
not
 
-   d_instantiate(dentry, inode);
+   d_add(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
 };
 
 /**
- * create_dir - create a dir in the tracefs filesystem
+ * lookup_dir_entry - look up a dir in the tracefs filesystem
+ * @dentry: the directory to look up
  * @ei: the eventfs_inode that represents the directory to create
- * @parent: parent dentry for this file.
  *
- * This function will create a dentry for a directory represented by
+ * This function will look up a dentry for a directory represented by
  * a eventfs_inode.
  */
-static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry 
*parent)
+static struct dentry *lookup_dir_entry(struct dentry *dentry,
+   struct eventfs_inode *pei, struct eventfs_inode *ei)
 {
struct tracefs_inode *ti;
-   struct dentry *dentry;
struct inode *inode;
 
-   dentry = eventfs_start_creating(ei->name, parent);
-   if (IS_ERR(dentry))
-   return dentry;
-
inode = tracefs_get_inode(dentry->d_sb);
if (unlikely(!inode))
return eventfs_failed_creating(dentry);
@@ -372,8 +359,11 @@ static struct dentry *create_dir(struc

[PATCH 2/6] eventfsfs: initialize the tracefs inode properly

2024-01-30 Thread Linus Torvalds
The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t 
mode,
inode->i_ino = EVENTFS_FILE_INODE_INO;
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE;
+   ti->flags = TRACEFS_EVENT_INODE;
+   ti->private = NULL; // Directories have 'ei', files 
not
+
d_instantiate(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, 
struct dentry *parent
inode->i_ino = eventfs_dir_ino(ei);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE;
+   ti->flags = TRACEFS_EVENT_INODE;
+   ti->private = ei;
 
inc_nlink(inode);
d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
struct eventfs_inode *ei_child;
-   struct tracefs_inode *ti;
 
lockdep_assert_held(_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode 
*ei)
 srcu_read_lock_held(_srcu)) {
ei_child->d_parent = ei->dentry;
}
-
-   ti = get_tracefs(ei->dentry->d_inode);
-   ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char 
*name, struct dentry
INIT_LIST_HEAD(>list);
 
ti = get_tracefs(inode);
-   ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+   ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
ti->private = ei;
 
inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb




[PATCH 1/6] tracefs: avoid using the ei->dentry pointer unnecessarily

2024-01-30 Thread Linus Torvalds
The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, 
struct dentry *dentry,
return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry 
*dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct 
super_block *sb)
 {
-   struct inode *inode;
+   struct inode *root;
 
/* Only update if the "events" was on the top level */
if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;
 
/* Get the tracefs root inode. */
-   inode = d_inode(dentry->d_sb->s_root);
-   ei->attr.uid = inode->i_uid;
-   ei->attr.gid = inode->i_gid;
+   root = d_inode(sb->s_root);
+   ei->attr.uid = root->i_uid;
+   ei->attr.gid = root->i_gid;
 }
 
 static void set_top_events_ownership(struct inode *inode)
 {
struct tracefs_inode *ti = get_tracefs(inode);
struct eventfs_inode *ei = ti->private;
-   struct dentry *dentry;
 
/* The top events directory doesn't get automatically updated */
if (!ei || !ei->is_events || !(ei->attr.mode & EVENTFS_TOPLEVEL))
return;
 
-   dentry = ei->dentry;
-
-   update_top_events_attr(ei, dentry);
+   update_top_events_attr(ei, inode->i_sb);
 
if (!(ei->attr.mode & EVENTFS_SAVE_UID))
inode->i_uid = ei->attr.uid;
@@ -235,8 +232,10 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
 
mutex_lock(_mutex);
do {
-   /* The parent always has an ei, except for events itself */
-   ei = dentry->d_parent->d_fsdata;
+   // The parent is stable because we do not do renames
+   dentry = dentry->d_parent;
+   // ... and directories always have d_fsdata
+   ei = dentry->d_fsdata;
 
/*
 * If the ei is being freed, the ownership of the children
@@ -246,12 +245,11 @@ static struct eventfs_inode *eventfs_find_events(struct 
dentry *dentry)
ei = NULL;
break;
}
-
-   dentry = ei->dentry;
+   // Walk upwards until you find the events inode
} while (!ei->is_events);
mutex_unlock(_mutex);
 
-   update_top_events_attr(ei, dentry);
+   update_top_events_attr(ei, dentry->d_sb);
 
return ei;
 }
-- 
2.43.0.5.g38fb137bdb




Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 08:55, Linus Torvalds
 wrote:
>
> Let me just reboot into it to test that I got all the cases.
>
> It makes the code even more obvious, and avoids all the complexity.

Yup, this works at least for the case you pointed out, and it really
feels like the RightThing(tm) from a VFS standpoint.

NOTE! This patch is on top of my previous 5-patch series.

 Linus
From 67b31bb53ef65132a65bfbe915628c481a39e73a Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Tue, 30 Jan 2024 09:00:18 -0800
Subject: [PATCH] eventfs: clean up dentry ops and add revalidate function

In order for the dentries to stay up-to-date with the eventfs changes,
just add a 'd_revalidate' function that checks the 'is_freed' bit.

Also, clean up the dentry release to actually use d_release() rather
than the slightly odd d_iput() function.  We don't care about the inode,
all we want to do is to get rid of the refcount to the eventfs data
added by dentry->d_fsdata.

It would probably be cleaner to make eventfs its own filesystem, or at
least set its own dentry ops when looking up eventfs files.  But as it
is, only eventfs dentries use d_fsdata, so we don't really need to split
these things up by use.

Another thing that might be worth doing is to make all eventfs lookups
mark their dentries as not worth caching.  We could do that with
d_delete(), but the DCACHE_DONTCACHE flag would likely be even better.

As it is, the dentries are all freeable, but they only tend to get freed
at memory pressure rather than more proactively.  But that's a separate
issue.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 21 +
 fs/tracefs/inode.c   | 27 ++-
 fs/tracefs/internal.h|  3 ++-
 3 files changed, 25 insertions(+), 26 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index a37db0dac302..acdc797bd9c0 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -414,23 +414,14 @@ static inline struct eventfs_inode *alloc_ei(const char *name)
 
 
 /**
- * eventfs_set_ei_status_free - remove the dentry reference from an eventfs_inode
- * @ti: the tracefs_inode of the dentry
+ * eventfs_d_release - dentry is going away
  * @dentry: dentry which has the reference to remove.
  *
  * Remove the association between a dentry from an eventfs_inode.
  */
-void eventfs_set_ei_status_free(struct tracefs_inode *ti, struct dentry *dentry)
+void eventfs_d_release(struct dentry *dentry)
 {
-	struct eventfs_inode *ei;
-
-	mutex_lock(_mutex);
-	ei = dentry->d_fsdata;
-	if (ei) {
-		dentry->d_fsdata = NULL;
-		put_ei(ei);
-	}
-	mutex_unlock(_mutex);
+	put_ei(dentry->d_fsdata);
 }
 
 /**
@@ -517,9 +508,8 @@ static struct dentry *eventfs_root_lookup(struct inode *dir,
 	}
 
  enoent:
-	/* Nothing found? */
-	d_add(dentry, NULL);
-	result = NULL;
+	/* Don't cache negative lookups, just return an error */
+	result = ERR_PTR(-ENOENT);
 
  out:
 	mutex_unlock(_mutex);
@@ -857,6 +847,5 @@ void eventfs_remove_events_dir(struct eventfs_inode *ei)
 	 * sticks around while the other ei->dentry are created
 	 * and destroyed dynamically.
 	 */
-	simple_recursive_removal(dentry, NULL);
 	dput(dentry);
 }
diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index e1b172c0e091..64122787e5d0 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -379,21 +379,30 @@ static const struct super_operations tracefs_super_operations = {
 	.show_options	= tracefs_show_options,
 };
 
-static void tracefs_dentry_iput(struct dentry *dentry, struct inode *inode)
+/*
+ * It would be cleaner if eventfs had its own dentry ops.
+ *
+ * Note that d_revalidate is called potentially under RCU,
+ * so it can't take the eventfs mutex etc. It's fine - if
+ * we open a file just as it's marked dead, things will
+ * still work just fine, and just see the old stale case.
+ */
+static void tracefs_d_release(struct dentry *dentry)
 {
-	struct tracefs_inode *ti;
+	if (dentry->d_fsdata)
+		eventfs_d_release(dentry);
+}
 
-	if (!dentry || !inode)
-		return;
+static int tracefs_d_revalidate(struct dentry *dentry, unsigned int flags)
+{
+	struct eventfs_inode *ei = dentry->d_fsdata;
 
-	ti = get_tracefs(inode);
-	if (ti && ti->flags & TRACEFS_EVENT_INODE)
-		eventfs_set_ei_status_free(ti, dentry);
-	iput(inode);
+	return !(ei && ei->is_freed);
 }
 
 static const struct dentry_operations tracefs_dentry_operations = {
-	.d_iput = tracefs_dentry_iput,
+	.d_revalidate = tracefs_d_revalidate,
+	.d_release = tracefs_d_release,
 };
 
 static int trace_fill_super(struct super_block *sb, void *data, int silent)
diff --git a/fs/tracefs/internal.h b/fs/tracefs/internal.h
index 72db3bdc4dfb..d4194466b643 100644
--- a/fs/tracefs/internal.h
+++ b/fs/tracefs/internal.h
@@ -79,6 +79,7 @@ struct inode *tracefs_get_inode(struct super_block *sb);
 struct dentry *eventfs_start_creating(const char *name, struct dentry *parent);
 struct dent

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 08:49, Steven Rostedt  wrote:
>
> - On removal, I got rid of the SRCU callback and the work queue.
>   Instead, I find the dentry of the current eventfs_inode that is being
>   deleted by walking the ei->parent until I find the events inode that has
>   a dentry. I then use that to do a lookup walking back down to the
>   eventfs_inode I want to delete. This gives me the dentry that I can call
>   d_invalidate() on.

Yes, that works.

However, I have a patch that is *much* smaller and simpler, and
doesn't need that walk.

The VFS layer already has a good interface for "should I still use
this dentry", which is needed for various network filesystems etc that
want to time out caches (or check explicitly whether the file still
exists etc): it's the dentry d_revalidate() check.

Let me just reboot into it to test that I got all the cases.

It makes the code even more obvious, and avoids all the complexity.

   Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 06:39, Steven Rostedt  wrote:
>
> On Tue, 30 Jan 2024 01:12:05 -0800
> >
> > I suspect the solution is to make eventfs_create_dir() do the same as
> > the events directory case does, and actually pin the directory dentry
> > and save it off.
>
> I rather not have the create do that because that happens for every event
> directory.

I wasn't thinking straight yesterday - the real fix is to just do a
"d_revalidate()". It's literally why that thing exists: check whether
a dentry is still valid.

In fact, that trivially gets rid of the 'events' subdirectory issues
too, so we can stop doing that horrendous simple_recursive_removal()
too.

Let me reboot into the trivial fix. I just needed to think about this
the right way.

None of this is anything that the VFS layer has any problems with.

   Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Tue, 30 Jan 2024 at 00:43, Linus Torvalds
 wrote:
>
> I'll go back to bed, but I think the fix is something trivial like this:

Almost.

>   + result = ERR_PTR(ENOENT);

That needs a '-' in front of the ENOENT, otherwise you have a positive
error number and things go wrong very quickly.

And that does indeed fix the lookup problem, but you end up with the
same problem later when you do the eventfs_remove_dir(). Again the
eventfs data structure changes, but we don't have a reliable dentry
that we can invalidate.

The dentry cache is just very good at caching those old dentries, and
the interface for eventfs_create_dir() and eventfs_remove_dir() is
just not great.

If those did an actual path lookup (like eventfs_create_events_dir()
does), we'd have the dentry, and it's trivial to get from dentry to
eventfs_inode.

But going the other way is the broken thing because of how the
dentries are just temporary caches.

I suspect the solution is to make eventfs_create_dir() do the same as
the events directory case does, and actually pin the directory dentry
and save it off.

Oh well. At least I understand what the problem is. Now I'm going to
try to go back to sleep.

  Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-30 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 19:56, Linus Torvalds
 wrote:
>
> But I've been staring at this code for too long, so I'm posting this
> just as a "it's broken, but _something_ like this", because I'm taking
> a break from looking at this.

Bah. I literally woke up and realized what the problem is.

We're caching negative dentries, but eventfs_create_dir() has no way
to invalidate the old negative dentry when it adds a new entry.

The old "try to reuse dentry" ended up hiding that issue, because it
would look up the negative dentry from the 'ei' and turn it into a
positive one.

And I was stupidly staring at the wrong code - all these functions
with almost the same names.

eventfs_create_events_dir() is fine, because it gets the parent as a
dentry, so looking up the new dentry is trivial.

But eventfs_create_dir() doesn't get a dentry at all. It gets the parent as a

  struct eventfs_inode *parent

which is no good.

So that explains why creating an event after deleting an old one - or
after just looking it up before it was created - ends up with the
nonsensical "ls" output - it gets listed by readdir() because the
entry is there in the eventfs data structures, but then doing a
"stat()" on it will just hit the negative dentry. So it won't actually
look up the dentry.

The simple solution is probably just to not cache negative dentries
for eventfs.  So instead of doing the "d_add(dentry, NULL);" we should
just return "ERR_PTR(ENOENT)".

Which is actually what /proc/ lookups do, for the same reason.

I'll go back to bed, but I think the fix is something trivial like this:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -517,9 +517,8 @@ static struct dentry *eventfs_root_lookup(
}

enoent:
  - /* Nothing found? */
  - d_add(dentry, NULL);
  - result = NULL;
  + /* Don't cache negative lookups, just return an error */
  + result = ERR_PTR(ENOENT);

out:
mutex_unlock(_mutex);

I just wanted to write it down when the likely solution struck me.

   Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 17:50, Linus Torvalds
 wrote:
>
> So what I propose is that
>
>  - ei->dentry and ei->d_children[] need to die. Really. They are
> buggy. There is no way to save them. There never was.
>
>  - but we *can* introduce a new 'ei->events_dir' pointer that is
> *only* set by eventfs_create_events_dir(), and which is stable exactly
> because that function also does a dget() on it, so now the dentry will
> actually continue to exist reliably
>
> I think that works.

Well, it doesn't. I don't see where the bug is, but since Al is now
aware of the thread, maybe when he wakes up he will tell me where I've
gone wrong.

In the meantime I did do the pending tracefs pull, so the series has
changed a bit, and this is the rebased series on top of my current
public git tree.

It is still broken wrt 'events' directories. You don't even need the
"create, delete, create" sequence that Steven pointed out, just a
plain sequence of

 # cd /sys/kernel/tracing
 # ls events/kprobes/
 # echo 'p:sched schedule' >> kprobe_events

messes up - ie it's enough to just have 'lookup' create a negative
dentry by trying to look up 'events/kprobes/' before actually trying
to create that kprobe_events.

But I've been staring at this code for too long, so I'm posting this
just as a "it's broken, but _something_ like this", because I'm taking
a break from looking at this.

I'll get back to it tomorrow, but I hope that Al will show me the
error of my ways.

  Linus
From 6763ac4af7ccc0c97fb5f7c98d0c8ae1289ec0fe Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Mon, 29 Jan 2024 18:49:42 -0800
Subject: [PATCH 5/5] eventfs: get rid of dentry pointers without refcounts

The eventfs inode had pointers to dentries (and child dentries) without
actually holding a refcount on said pointer.  That is fundamentally
broken, and while eventfs tried to then maintain coherence with dentries
going away by hooking into the '.d_iput' callback, that doesn't actually
work since it's not ordered wrt lookups.

There were two reasonms why eventfs tried to keep a pointer to a dentry:

 - the creation of a 'events' directory would actually have a stable
   dentry pointer that it created with tracefs_start_creating().

   And it needed that dentry when tearing it all down again in
   eventfs_remove_events_dir().

   This use is actually ok, because the special top-level events
   directory dentries are actually stable, not just a temporary cache of
   the eventfs data structures.

 - the 'eventfs_inode' (aka ei) needs to stay around as long as there
   are dentries that refer to it.

   It then used these dentry pointers as a replacement for doing
   reference counting: it would try to make sure that there was only
   ever one dentry associated with an event_inode, and keep a child
   dentry array around to see which dentries might still refer to the
   parent ei.

This gets rid of the invalid dentry pointer use, and renames the one
valid case to a different name to make it clear that it's not just any
random dentry.

The magic child dentry array that is kind of a "reverse reference list"
is simply replaced by having child dentries take a ref to the ei.  As
does the directory dentries.  That makes the broken use case go away.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 245 ---
 fs/tracefs/internal.h|   9 +-
 2 files changed, 80 insertions(+), 174 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1d0102bfd7da..a37db0dac302 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -62,6 +62,34 @@ enum {
 
 #define EVENTFS_MODE_MASK	(EVENTFS_SAVE_MODE - 1)
 
+/*
+ * eventfs_inode reference count management.
+ *
+ * NOTE! We count only references from dentries, in the
+ * form 'dentry->d_fsdata'. There are also references from
+ * directory inodes ('ti->private'), but the dentry reference
+ * count is always a superset of the inode reference count.
+ */
+static void release_ei(struct kref *ref)
+{
+	struct eventfs_inode *ei = container_of(ref, struct eventfs_inode, kref);
+	kfree(ei->entry_attrs);
+	kfree(ei);
+}
+
+static inline void put_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_put(>kref, release_ei);
+}
+
+static inline struct eventfs_inode *get_ei(struct eventfs_inode *ei)
+{
+	if (ei)
+		kref_get(>kref);
+	return ei;
+}
+
 static struct dentry *eventfs_root_lookup(struct inode *dir,
 	  struct dentry *dentry,
 	  unsigned int flags);
@@ -289,7 +317,8 @@ static void update_inode_attr(struct dentry *dentry, struct inode *inode,
  * directory. The inode.i_private pointer will point to @data in the open()
  * call.
  */
-static struct dentry *lookup_file(struct dentry *dentry,
+static struct dentry *lookup_file(struct eventfs_inode *parent_ei,
+  struct dentry *dentry,
   umode_t mode,
   struct eventfs_attr

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 16:35, Steven Rostedt  wrote:
>
>  # echo 'p:sched schedule' >> /sys/kernel/tracing/kprobe_events
>  # ls -l events/kprobes/
> ls: cannot access 'events/kprobes/': No such file or directory
>
> Where it should now exist but doesn't. But the lookup code never triggered.
>
> If the lookup fails, does it cache the result?

I think you end up still having negative dentries around.

The old code then tried to compensate for that by trying to remember
the old dentry with 'ei->dentry' and 'ei->d_children[]', and would at
lookup time try to use the *old* dentry instead of the new one.

And because dentries are just caches and can go away, it then had that
odd dance with '.d_iput', so that when a dentry was removed, it would
be removed from the 'ei->dentry' and 'ei->d_children[]' array too.

Except that d_iput() of an old dentry isn't actually serialized with
->d_lookup() in any way, so you end up with the whole race that I
already talked about earlier, where you could still have an
'ei->dentry' that pointed to something that had already been unhashed,
but d_iput() hadn't been called *yet*, so d_lookup() is called with a
new dentry, but the tracefs code then desperately tries to use the old
dentry pointer that just isn't _valid_ any more, but it doesn't know
that because d_iput() hasn't been called yet...

And as I *also* pointed out when I described that originally, you'll
practically never hit this race, because you just need to be *very*
unlucky with the whole "dentry is freed due to memory pressure".

But basically, this is why I absolutely *HATE* that "ei->dentry"
backpointer. It's truly fundamentally broken.

You can't reference-count it, since the whole point of your current
tracefs scheme is to *not* keep dentries and inodes around forever,
and doing a "dget()" on that 'ei->dentry' would thus fundamentally
screw that up.

But you also cannot keep it in sync with dentries being released due
to memory pressure, because of the above thing.

See why I've tried to tell you that the back-pointer is basically a
100% sign of a bug.

The *only* time you can have a valid dentry pointer is when you have
also taken a ref to it with dget(), and you can't do that.

So then you have all that completely broken code that _tries_ to
maintain consistency with ->d_children[] etc, and it works 99.9% in
practice, because the race is just so hard to hit because dentries
only normally get evicted either synchronously (which you do under the
eventfs_mutex) or under memory pressure (which is basically never
going to be something you can test).

And yes, my lookup patch removed all the band-aids for "if I have an
ei->dentry, I'll reuse it". So I think it ends up exposing all the
previous bugs that the old "let's reuse the old dentry" code tried to
hide.

But, as mentioned, that ei->dentry pointer really REALLY is broken.

NBow, having looked at this a lot, I think I have a way forward.
Because there is actually *one* case where you actually *do* do the
whole "dget()" to get a stable dentry pointer. And that's exactly the
"events" directory creation (ie eventfs_create_events_dir()).

So what I propose is that

 - ei->dentry and ei->d_children[] need to die. Really. They are
buggy. There is no way to save them. There never was.

 - but we *can* introduce a new 'ei->events_dir' pointer that is
*only* set by eventfs_create_events_dir(), and which is stable exactly
because that function also does a dget() on it, so now the dentry will
actually continue to exist reliably

I think that works. The only thing that actually *needs* the existing
'ei->dentry' is literally the eventfs_remove_events_dir() that gets
rid of the stable events directory. It's undoing
eventfs_create_events_dir(), and it will do the final dput() too.

I will try to make a patch for this. I do think it means that every
time we do that

dentry->d_fsdata = ei;

we need to also do proper reference counting of said 'ei'. Because we
can't release 'ei' early when we have dentries that point to it.

Let me see how painful this will be.

 Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 14:49, Steven Rostedt  wrote:
>
> Now I didn't change this last d_instantiate, because this is not called
> through the lookup code. This is the root events directory and acts more
> like debugfs. It's not "dynamically" added.

Ahh, yes, I see, the dentry was created (as a negative one) with
tracefs_start_creating() -> lookup_one_len().

So  yes, there d_instantiate() is correct, as it's exactly that "turn
negative dentry into a positive one" case.

I'll go see what's up with the "create it again" case - I don't
immediately see what's wrong.

  Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 14:35, Linus Torvalds
 wrote:
>
> So just replace all the d_instantiate() calls there with "d_add()"
> instead. I think that will fix it.

I can confirm that with the mutex deadlock removed and the d_add()
fix, at least things *look* superficially ok.

I didn't actually do anything with it. So it might be leaking dentry
refs like mad or something like that, but at least the obvious cases
look fine.

just for completeness, here's the fixup diff I used.

  Linus
 fs/tracefs/event_inode.c | 8 +++-
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index cd6de322..5b307bb64f8f 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -230,7 +230,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 {
 	struct eventfs_inode *ei;
 
-	mutex_lock(_mutex);
 	do {
 		// The parent is stable because we do not do renames
 		dentry = dentry->d_parent;
@@ -247,7 +246,6 @@ static struct eventfs_inode *eventfs_find_events(struct dentry *dentry)
 		}
 		// Walk upwards until you find the events inode
 	} while (!ei->is_events);
-	mutex_unlock(_mutex);
 
 	update_top_events_attr(ei, dentry->d_sb);
 
@@ -324,7 +322,7 @@ static struct dentry *lookup_file(struct dentry *dentry,
 	ti->flags = TRACEFS_EVENT_INODE;
 	ti->private = NULL;			// Directories have 'ei', files not
 
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
 };
@@ -365,7 +363,7 @@ static struct dentry *lookup_dir_entry(struct dentry *dentry,
 ei->dentry = dentry;	// Remove me!
 
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -786,7 +784,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 
 	/* directory inodes start off with i_nlink == 2 (for "." entry) */
 	inc_nlink(inode);
-	d_instantiate(dentry, inode);
+	d_add(dentry, inode);
 	inc_nlink(dentry->d_parent->d_inode);
 	fsnotify_mkdir(dentry->d_parent->d_inode, dentry);
 	tracefs_end_creating(dentry);


Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 14:21, Steven Rostedt  wrote:
>
> But crashes with just a:
>
>  # ls /sys/kernel/tracing/events
>
> [   66.423983] [ cut here ]
> [   66.426447] kernel BUG at fs/dcache.c:1876!

Duh.

That's a bit too much copy-and-paste by me.

So what is going on is that a ->lookup() function should *not* call
d_instantiate() at all, and the only reason it actually used to work
here was due to the incorrect "simple_lookup()", which basically did
all the preliminaries.

A ->lookup() should do 'd_add()' on the dentry.

So just replace all the d_instantiate() calls there with "d_add()"
instead. I think that will fix it.

Basically the "simple_lookup()" had done the "d_add(dentry, NULL)",
and at that point the "d_instantiate()" just exposed the inode and
turned the negative dentry into a positive one.

So "d_add()" is "I'm adding the inode to a new dentry under lookup".
And "d_instantiate()" is "I'm adding this inode to an existing dentry
that used to be negative"

And so the old "d_add(NULL)+d_instantiate(inode)" _kind_ of worked,
except it made that negative dentry visible for a short while.

And when I did the cleanup, I didn't think of this thing, so I left
the d_instantiate() calls as such, even though they now really need to
be d_add().

Hope that explains it.

And I hope there aren't any other stupid things I missed like that.

 Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 13:45, Steven Rostedt  wrote:
> >  1 file changed, 50 insertions(+), 219 deletions(-)
>
> Thanks, much appreciated.

Well, I decided I might as well give it a test-run, and there's an
immediate deadlock on eventfs_mutex, because I missed removing it from
eventfs_find_events() when the callers now already hold it.

So at a minimum, it will require this patch on top:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -230,7 +230,6 @@ static struct eventfs_inode
*eventfs_find_events(
   {
struct eventfs_inode *ei;

  - mutex_lock(_mutex);
do {
// The parent is stable because we do not do renames
dentry = dentry->d_parent;
  @@ -247,7 +246,6 @@
}
// Walk upwards until you find the events inode
} while (!ei->is_events);
  - mutex_unlock(_mutex);

update_top_events_attr(ei, dentry->d_sb);

to not deadlock immediately on the first lookup.

And honestly, there might be other such obvious "I missed that when
reading the code".

Let me reboot into a fixed system and do some more basic smoke-testing.

Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 12:25, Steven Rostedt  wrote:
>
> > So the fundamental bug I now find is that eventfs_root_lookup() gets a
> > target dentry, and for some unfathomable reason it then does
> >
> > ret = simple_lookup(dir, dentry, flags);
> >
> > on it. Which is *completely* broken, because what "simple_lookup()"
> > does is just say "oh, you didn't have a dentry of this kind before, so
> > clearly a lookup must be a non-existent file". Remember: this is for
> > 'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
> > files.
>
> Sorry, I don't really understand what you mean by "ALL files"? You mean
> that all files in the pseudo file system has a dentry to it (like debugfs,
> and the rest of tracefs)?

Yes.

So the whole - and *ONLY* - point of 'simple_lookup()' is for
filesystems like tmpfs, or like debugfs or other filesystems like
that, which never actually *need* to look anything up, because
everything is already cached in the dentry tree.

That's what the "simple" part of the simple functions mean. They are
simple from a dcache standpoint, because the dcache is all there is.

End result: what simple_lookup() does is say "oh, you didn't have the
file, so it's by definition a negative dentry", and thus all it does
is to do "d_add(dentry, NULL)".

Anyway, removing this was painful. I initially thought "I'll just
remove the calls". But it all ended up cascading into "that's also
wrong".

So now I have a patch that tries to fix this all up, and it looks like thisL:

 1 file changed, 50 insertions(+), 219 deletions(-)

because it basically removed all the old code, and replaced it with
much simpler code.

I'm including the patch here as an attachment, but I want to note very
clearly that this *builds* for me, and it looks a *lot* more obvious
and correct than the old code did, but I haven't tested it. AT ALL.

Also note that it depends on my previous patches, so I guess I'll
include them here again just to make it unambiguous.

Finally - this does *not* fix up the refcounting. I still think the
SRCU stuff is completely broken. But that's another headache. But at
least now the *lookup* parts look like they DTRT wrt eventfs_mutex.

The SRCU logic from the directory iteration parts still needs crapectomy.

AGAIN: these patches (ie particularly that last one - 0004) were all
done entirely "blindly" - I've looked at the code, and fixed the bugs
and problems I've seen by pure code inspection.

That's great, but it really means that it's all untested. It *looks*
better than the old code, but there may be some silly gotcha that I
have missed.

Linus
From b1f487acf6f4e9093d8b0fa00f864a6d07a3c4c2 Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Sat, 27 Jan 2024 13:27:01 -0800
Subject: [PATCH 2/4] tracefs: avoid using the ei->dentry pointer unnecessarily

The eventfs_find_events() code tries to walk up the tree to find the
event directory that a dentry belongs to, in order to then find the
eventfs inode that is associated with that event directory.

However, it uses an odd combination of walking the dentry parent,
looking up the eventfs inode associated with that, and then looking up
the dentry from there.  Repeat.

But the code shouldn't have back-pointers to dentries in the first
place, and it should just walk the dentry parenthood chain directly.

Similarly, 'set_top_events_ownership()' looks up the dentry from the
eventfs inode, but the only reason it wants a dentry is to look up the
superblock in order to look up the root dentry.

But it already has the real filesystem inode, which has that same
superblock pointer.  So just pass in the superblock pointer using the
information that's already there, instead of looking up extraneous data
that is irrelevant.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 26 --
 1 file changed, 12 insertions(+), 14 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 1c3dd0ad4660..2d128bedd654 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -156,33 +156,30 @@ static int eventfs_set_attr(struct mnt_idmap *idmap, struct dentry *dentry,
 	return ret;
 }
 
-static void update_top_events_attr(struct eventfs_inode *ei, struct dentry *dentry)
+static void update_top_events_attr(struct eventfs_inode *ei, struct super_block *sb)
 {
-	struct inode *inode;
+	struct inode *root;
 
 	/* Only update if the "events" was on the top level */
 	if (!ei || !(ei->attr.mode & EVENTFS_TOPLEVEL))
 		return;
 
 	/* Get the tracefs root inode. */
-	inode = d_inode(dentry->d_sb->s_root);
-	ei->attr.uid = inode->i_uid;
-	ei->attr.gid = inode->i_gid;
+	root = d_inode(sb->s_root);
+	ei->attr.uid = root->i_uid;
+	ei->attr.gid = root->i_gid;
 }
 
 st

Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 11:24, Linus Torvalds
 wrote:
>
> So the patch was completely broken. Here's the one that should
> actually compile (although still not actually *tested*).

Note that this fixes the d_instantiate() ordering wrt initializing the inode.

But as I look up the call chain, I see many more fundamental mistakes.

Steven - the reason you think that the VFS doesn't have documentation
is that we *do* have tons of documentation, but it's of the kind "Here
is what you should do".

It is *not* of the kind that says "You messed up and did something
else, and how do you recover from it?".

So the fundamental bug I now find is that eventfs_root_lookup() gets a
target dentry, and for some unfathomable reason it then does

ret = simple_lookup(dir, dentry, flags);

on it. Which is *completely* broken, because what "simple_lookup()"
does is just say "oh, you didn't have a dentry of this kind before, so
clearly a lookup must be a non-existent file". Remember: this is for
'tmpfs' kinds of filesystems where the dentry cache cotnains *ALL*
files.

For the tracefs kind of filesystem, it's TOTALLY BOGUS. What the
"simple_lookup()" will do is just a plain

d_add(dentry, NULL);

and nothing else. And guess what *that* does? It basically
instantiates a negative dentry, telling all other lookups that the
path does not exist.

So if you have two concurrent lookups, one will do that
simple_lookup(), and the other will then - depending on timing -
either see the negative dentry and return -ENOENT, or - if it comes in
a bit later - see the new inode that then later gets added by the
first lookup with d_instantiate().

See? That simple_lookup() is not just unnecessary, but it's also
actively completely WRONG. Because it instantiates a NULL pointer,
other processes that race with the lookup may now end up saying "that
file doesn't exist", even though it should.

Basically, you can't use *any* of the "simple" filesystem helpers.
Because they are all designed for that "the dentry tree is all there
is" case.

Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
 wrote:
>
> IOW, I think the right fix is really just this:

Oh, for some reason I sent out the original patch I had which didn't
fix the create_dir() case.

So that patch was missing the important hunk that added the

ti->flags = TRACEFS_EVENT_INODE;
ti->private = ei;

to create_dir() (to match the removal in eventfs_post_create_dir()).

I had incorrectly put it in the create_file() case, that should just
set ->private to NULL. afaik

So the patch was completely broken. Here's the one that should
actually compile (although still not actually *tested*).

   Linus
From 6e5db10ebc96ebe6b9707c9938c450f51e9a3ae0 Mon Sep 17 00:00:00 2001
From: Linus Torvalds 
Date: Mon, 29 Jan 2024 11:06:32 -0800
Subject: [PATCH] eventfsfs: initialize the tracefs inode properly

The tracefs-specific fields in the inode were not initialized before the
inode was exposed to others through the dentry with 'd_instantiate()'.

And the ->flags file was initialized incorrectly with a '|=', when the
old value was stale.  It should have just been a straight assignment.

Move the field initializations up to before the d_instantiate, and fix
the use of uninitialized data.

Signed-off-by: Linus Torvalds 
---
 fs/tracefs/event_inode.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/fs/tracefs/event_inode.c b/fs/tracefs/event_inode.c
index 2d128bedd654..c0d977e6c0f2 100644
--- a/fs/tracefs/event_inode.c
+++ b/fs/tracefs/event_inode.c
@@ -328,7 +328,9 @@ static struct dentry *create_file(const char *name, umode_t mode,
 	inode->i_ino = EVENTFS_FILE_INODE_INO;
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = NULL;			// Directories have 'ei', files not
+
 	d_instantiate(dentry, inode);
 	fsnotify_create(dentry->d_parent->d_inode, dentry);
 	return eventfs_end_creating(dentry);
@@ -367,7 +369,8 @@ static struct dentry *create_dir(struct eventfs_inode *ei, struct dentry *parent
 	inode->i_ino = eventfs_dir_ino(ei);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE;
+	ti->flags = TRACEFS_EVENT_INODE;
+	ti->private = ei;
 
 	inc_nlink(inode);
 	d_instantiate(dentry, inode);
@@ -513,7 +516,6 @@ create_file_dentry(struct eventfs_inode *ei, int idx,
 static void eventfs_post_create_dir(struct eventfs_inode *ei)
 {
 	struct eventfs_inode *ei_child;
-	struct tracefs_inode *ti;
 
 	lockdep_assert_held(_mutex);
 
@@ -523,9 +525,6 @@ static void eventfs_post_create_dir(struct eventfs_inode *ei)
  srcu_read_lock_held(_srcu)) {
 		ei_child->d_parent = ei->dentry;
 	}
-
-	ti = get_tracefs(ei->dentry->d_inode);
-	ti->private = ei;
 }
 
 /**
@@ -943,7 +942,7 @@ struct eventfs_inode *eventfs_create_events_dir(const char *name, struct dentry
 	INIT_LIST_HEAD(>list);
 
 	ti = get_tracefs(inode);
-	ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
+	ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
 	ti->private = ei;
 
 	inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
-- 
2.43.0.5.g38fb137bdb



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:44, Steven Rostedt  wrote:
>
> On Mon, 29 Jan 2024 09:40:06 -0800
> Linus Torvalds  wrote:
>
> > Now, I do agree that your locking is strange, and that should be fixed
> > *too*, but I think the above is the "right" fix for this particular
> > issue.
>
> Would you be OK if I did both as a "fix"?

See my crossed email - not dropping the mutex *is* actually a fix for
another piece of data.

  Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:40, Linus Torvalds
 wrote:
>
> eventfs_create_events_dir() seems to have the same bug with ti->flags,
> btw, but got the ti->private initialization right.
>
> Funnily enough, create_file() got this right. I don't even understand
> why create_dir() did what it did.
>
> IOW, I think the right fix is really just this:

Actually, I think you have another uninitialized field here too:
'dentry->d_fsdata'.

And it looks like both create_file and create_dir got that wrong, but
eventfs_create_events_dir() got it right.

So you *also* need to do that

dentry->d_fsdata = ei;

before you do the d_instantiate().

Now, from a quick look, all the d_fsdata accesses *do* seem to be
protected by the eventfs_mutex, except

 (a) eventfs_create_events_dir() doesn't seem to take the mutex, but
gets the ordering right, so is ok

 (b) create_dir_dentry() drops the mutex in the middle, so the mutex
doesn't actually serialize anything

Not dropping the mutex in create_dir_dentry() *will* fix this bug, but
honestly, I'd suggest that in addition to not dropping the mutex, you
just fix the ordering too.

IOW, just do that

dentry->d_fsdata = ei;

before you do d_instantiate(), and now accessing d_fsdata is just
always safe and doesn't even need the mutex.

The whole "initialize everything before exposing it to others" is
simply just a good idea.

Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-29 Thread Linus Torvalds
On Mon, 29 Jan 2024 at 09:01, Steven Rostedt  wrote:
>
> Thanks for the analysis. I have a patch that removes the dropping of the
> mutex over the create_dir/file() calls, and lockdep hasn't complained about
> it.
>
> I was going to add that to my queue for the next merge window along with
> other clean ups but this looks like it actually fixes a real bug. I'll move
> that over to my urgent queue and start testing it.

Note that it is *not* enough to just keep the mutex.

All the *users* need to get the mutex too.

Otherwise you have this:

CPU1:

   create_dir_dentry():
  mutex locked the whole time..
dentry = create_dir(ei, parent);
   does d_instantiate(dentry, inode);
eventfs_post_create_dir(ei);
   dentry->d_fsdata = ei;
mutex dropped;

but CPU2 can still come in, see the dentry immediately after the
"d_instantiate()", and do an "open()" or "stat()" on the dentry (which
will *not* cause a 'lookup()', since it's in the dentry cache), and
that will then cause either

 ->permission() -> eventfs_permission() -> set_top_events_ownership()

or

 ->get_attr() -> eventfs_get_attr() -> set_top_events_ownership()

and both of those will now do the dentry->inode->ei lookup. And
neither of them takes the mutex.

So then it doesn't even matter that you didn't drop the mutex in the
middle, because the users simply won't be serializing with it anyway.

So you'd have to make set_top_events_ownership() take the mutex around
it all too.

In fact, pretty much *any* use of "ti->private" needs the mutex.

Which is obviously a bit painful.

Honestly, I think the right model is to just make sure that the inode
is fully initialized when you do 'd_instantiate()'

The patch looks obvious, and I think this actually fixes *another*
bug, namely that the old

ti = get_tracefs(inode);
ti->flags |= TRACEFS_EVENT_INODE;

was buggy, because 'ti->flags' was uninitialized before.

eventfs_create_events_dir() seems to have the same bug with ti->flags,
btw, but got the ti->private initialization right.

Funnily enough, create_file() got this right. I don't even understand
why create_dir() did what it did.

IOW, I think the right fix is really just this:

  --- a/fs/tracefs/event_inode.c
  +++ b/fs/tracefs/event_inode.c
  @@ -328,7 +328,8 @@
inode->i_ino = EVENTFS_FILE_INODE_INO;

ti = get_tracefs(inode);
  - ti->flags |= TRACEFS_EVENT_INODE;
  + ti->flags = TRACEFS_EVENT_INODE;
  + ti->private = ei;
d_instantiate(dentry, inode);
fsnotify_create(dentry->d_parent->d_inode, dentry);
return eventfs_end_creating(dentry);
  @@ -513,7 +514,6 @@
   static void eventfs_post_create_dir(struct eventfs_inode *ei)
   {
struct eventfs_inode *ei_child;
  - struct tracefs_inode *ti;

lockdep_assert_held(_mutex);

  @@ -523,9 +523,6 @@
 srcu_read_lock_held(_srcu)) {
ei_child->d_parent = ei->dentry;
}
  -
  - ti = get_tracefs(ei->dentry->d_inode);
  - ti->private = ei;
   }

   /**
  @@ -943,7 +940,7 @@
INIT_LIST_HEAD(>list);

ti = get_tracefs(inode);
  - ti->flags |= TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
  + ti->flags = TRACEFS_EVENT_INODE | TRACEFS_EVENT_TOP_INODE;
ti->private = ei;

inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;

which fixes the initialization errors with the 'ti' fields.

Now, I do agree that your locking is strange, and that should be fixed
*too*, but I think the above is the "right" fix for this particular
issue.

Linus



Re: [linus:master] [eventfs] 852e46e239: BUG:unable_to_handle_page_fault_for_address

2024-01-28 Thread Linus Torvalds
On Sun, 28 Jan 2024 at 18:59, kernel test robot  wrote:
>
>   21:   48 8b 47 f8 mov-0x8(%rdi),%rax
>   25:   48 85 c0test   %rax,%rax
>   28:   74 11   je 0x3b
>   2a:*  f6 40 78 02 testb  $0x2,0x78(%rax)  <-- trapping 
> instruction

So this is

struct tracefs_inode *ti = get_tracefs(inode);
struct eventfs_inode *ei = ti->private;

if (!ei || !ei->is_events || ..

in set_top_events_ownership(), and it's that 'ei->is_events' test that oopses.

The 'inode' is the incoming argument (in %rdi), and you don't see code
generation for the "get_tracefs(inode)" because it's just an offset
off the inode.

So the "ti->private" read is that read off "-8(%rdi)", because

  struct tracefs_inode {
unsigned long   flags;
void*private;
struct inodevfs_inode;
  };

so 'private' is 8 bytes below the 'struct inode' pointer.

So 'ei' isn't NULL, but it's a bad pointer, and 'ei->is_events' is
that "testb  $0x2,0x78(%rax)" and it oopses as a result.

I don't think this is directly related to that commit 852e46e239ee
("eventfs: Do not create dentries nor inodes in iterate_shared") that
the kernel test robot talks about.

It looks like some inode creation never filled in the ->private field
at all, and it's just garbage.

I note that we have code like this:

 create_dir_dentry():
...
mutex_unlock(_mutex);

dentry = create_dir(ei, parent);

mutex_lock(_mutex);
...
if (!ei->dentry && !ei->is_freed) {
ei->dentry = dentry;
eventfs_post_create_dir(ei);
dentry->d_fsdata = ei;
} else {

and that eventfs_post_create_dir() seems to be where it fills in that
->private pointer:

ti = get_tracefs(ei->dentry->d_inode);
ti->private = ei;

but notice how create_dir() has done that

d_instantiate(dentry, inode);

which exposes the inode to lookup - long before it's actually then filled in.

IOW, what I think is going on is a very basic race, where
create_dir_dentry() will allocate the inode and attach it to the
dentry long before the inode has been fully initialized.

So if somebody comes in *immediately* after that, and does a 'stat()'
on that name that now is exposed, it will see an inode that has not
yet made it to that eventfs_post_create_dir() phase, and that in turn
explains why

struct eventfs_inode *ei = ti->private;

just reads random garbage values.

So if I'm right (big "if" - it looks likely, but who knows) this bug
is entirely unrelated to any dentry caching or any reference counting.

It just needs just the right timing, where one CPU happens to do a
'stat()' just as another one creates the directory.

It's not a big window, so you do need some timing "luck".

End result: the d_instantiate() needs to be done *after* the inode has
been fully filled in.

Alternatively, you could

 (a) not drop the eventfs_mutex around the create_dir() call

 (b) take the eventfs_mutex around all of set_top_events_ownership()

and just fix it by having the lock protect the lack of ordering.

 Linus



Re: [PATCH] eventfs: Give files a default of PAGE_SIZE size

2024-01-26 Thread Linus Torvalds
On Fri, 26 Jan 2024 at 10:41, Steven Rostedt  wrote:
>
> Fine, but I still plan on sending you the update to give all files unique
> inode numbers. If it screws up tar, it could possibly screw up something
> else.

Well, that in many ways just regularizes the code, and the dynamic
inode numbers are actually prettier than the odd fixed date-based one
you picked. I assume it's your birthdate (although I don't know what
the directory ino number was).

   Linus



Re: [PATCH] eventfs: Give files a default of PAGE_SIZE size

2024-01-26 Thread Linus Torvalds
On Fri, 26 Jan 2024 at 10:18, Steven Rostedt  wrote:
>
> By following what sysfs does, and give files a default size of PAGE_SIZE,
> it allows the tar to work. No event file is greater than PAGE_SIZE.

No, please. Just don't.

Nobody has asked for this, and nobody sane should use 'tar' on tracefs anyway.

It hasn't worked before, so saying "now you can use tar" is just a
*bad* idea. There is no upside, only downsides, with tar either (a)
not working at all on older kernels or (b) complaining about how the
size isn't reliable on newer ones.

So please. You should *NOT* look at "this makes tar work, albeit badly".

You should look at whether it improves REAL LOADS. And it doesn't. All
it does is add a hack for a bad idea. Leave it alone.

   Linus



Re: [PATCH v3 1/2] eventfs: Have the inodes all for files and directories all be the same

2024-01-22 Thread Linus Torvalds
On Mon, 22 Jan 2024 at 13:59, Darrick J. Wong  wrote:
>
>  though I don't think
> leaking raw kernel pointers is an awesome idea.

Yeah, I wasn't all that comfortable even with trying to hash it
(because I think the number of source bits is small enough that even
with a crypto hash, it's trivially brute-forceable).

See

   https://lore.kernel.org/all/20240122152748.46897...@gandalf.local.home/

for the current patch under discussion (and it contains a link _to_
said discussion).

   Linus



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 14:14, Al Viro  wrote:
>
> On Wed, Jan 03, 2024 at 01:54:36PM -0800, Linus Torvalds wrote:
>
> > Again: UNTESTED, and meant as a "this is another way to avoid messing
> > with the dentry tree manually, and just using the VFS interfaces we
> > already have"
>
> That would break chown(), though.

Right,. That's why I had that note about

   So take this as a "this might work", but it probably needs a bit more
   work - eventfs_set_attr() should set some bit in the inode to say
   "these have been set manually", and then revalidate would say "I'll
   not touch inodes that have that bit set".

and how my example patch overrides things a bit *too* aggressively.

That said, I think the patch that Steven just sent out is the right
direction to go regardless. The d_revalidate() thing was literally
just a "we can do this many different ways".

 Linus



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 14:04, Steven Rostedt  wrote:
>
> I actually have something almost working too. Here's the WIP. It only works
> for tracefs, and now eventfs needs to be updated as the "events" directory
> no longer has the right ownership. So I need a way to link the eventfs
> entries to use the tracefs default conditionally.

Ack. So the ->getattr() and ->permission() thing is a bit more
targeted to "look at modes", and is probably better just for that
reason.

Doing it in d_revalidate() is a bit hacky, and impacts path lookup a
bit even when not necessary. But it's still a lot less evil than
walking the dentry tree manually.

So that d_revalidate() patch was more of a "I think you can make it
smaller by just hooking in at this layer"). So d_revalidate ends up
with a smaller patch, I think, but it has the problem that now you
*have* to be able to deal with things in RCU context.

In contrast, doing it in ->getattr() and ->permission() ends up
meaning you can use sleeping locks etc if you need to serialize, for
example. So it's a bit more specific to the whole issue of "deal with
modes and owndership", but it is *also* a bit more flexible in how you
can then do it.

Anyway, your patch looks fine from a quick scan.

  Linus



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 13:54, Linus Torvalds
 wrote:
>
> Here's an updated patch that builds, and is PURELY AN EXAMPLE.

Oh, and while doing this patch, I found another bug in tracefs,
although it happily is one that doesn't have any way to trigger.

Tracefs has code like this:

if (dentry->d_inode->i_mode & S_IFDIR) {

and that's very wrong. S_IFDIR is not a bitmask, it's a value that is
part of S_IFMT.

The reason this bug doesn't have any way to trigger is that I think
tracefs can only have S_IFMT values of S_IFDIR and S_IFREG, and those
happen to not have any bits in common, so doing it as a bit test is
wrong, but happens to work.

The test *should* be done as

if (S_ISDIR(dentry->d_inode->i_mode)) {

(note "IS" vs "IF" - not the greatest user experience ever, but hey,
it harkens back to Ye Olden Times).

Linus



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 11:57, Linus Torvalds
 wrote:
>
> Or, you know, you could do what I've told you to do at least TEN TIMES
> already, which is to not mess with any of this, and just implement the
> '->permission()' callback (and getattr() to just make 'ls' look sane
> too, rather than silently saying "we'll act as if gid is set right,
> but not show it").

Actually, an even simpler option might be to just do this all at
d_revalidate() time.

Here's an updated patch that builds, and is PURELY AN EXAMPLE. I think
it "works", but it currently always resets the inode mode/uid/gid
unconditionally, which is wrong - it should not do so if the inode has
been manually set.

So take this as a "this might work", but it probably needs a bit more
work - eventfs_set_attr() should set some bit in the inode to say
"these have been set manually", and then revalidate would say "I'll
not touch inodes that have that bit set".

Or something.

Anyway, this patch is nwo relative to your latest pull request, so it
has the check for dentry->d_inode in set_gid() (and still removes the
whole function).

Again: UNTESTED, and meant as a "this is another way to avoid messing
with the dentry tree manually, and just using the VFS interfaces we
already have"

   Linus
 fs/tracefs/inode.c | 147 ++---
 1 file changed, 26 insertions(+), 121 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index bc86ffdb103b..5bc9e1a23a31 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -183,87 +183,6 @@ struct tracefs_fs_info {
 	struct tracefs_mount_opts mount_opts;
 };
 
-static void change_gid(struct dentry *dentry, kgid_t gid)
-{
-	if (!dentry->d_inode)
-		return;
-	dentry->d_inode->i_gid = gid;
-}
-
-/*
- * Taken from d_walk, but without he need for handling renames.
- * Nothing can be renamed while walking the list, as tracefs
- * does not support renames. This is only called when mounting
- * or remounting the file system, to set all the files to
- * the given gid.
- */
-static void set_gid(struct dentry *parent, kgid_t gid)
-{
-	struct dentry *this_parent;
-	struct list_head *next;
-
-	this_parent = parent;
-	spin_lock(_parent->d_lock);
-
-	change_gid(this_parent, gid);
-repeat:
-	next = this_parent->d_subdirs.next;
-resume:
-	while (next != _parent->d_subdirs) {
-		struct tracefs_inode *ti;
-		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
-		next = tmp->next;
-
-		/* Note, getdents() can add a cursor dentry with no inode */
-		if (!dentry->d_inode)
-			continue;
-
-		spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
-
-		change_gid(dentry, gid);
-
-		/* If this is the events directory, update that too */
-		ti = get_tracefs(dentry->d_inode);
-		if (ti && (ti->flags & TRACEFS_EVENT_INODE))
-			eventfs_update_gid(dentry, gid);
-
-		if (!list_empty(>d_subdirs)) {
-			spin_unlock(_parent->d_lock);
-			spin_release(>d_lock.dep_map, _RET_IP_);
-			this_parent = dentry;
-			spin_acquire(_parent->d_lock.dep_map, 0, 1, _RET_IP_);
-			goto repeat;
-		}
-		spin_unlock(>d_lock);
-	}
-	/*
-	 * All done at this level ... ascend and resume the search.
-	 */
-	rcu_read_lock();
-ascend:
-	if (this_parent != parent) {
-		struct dentry *child = this_parent;
-		this_parent = child->d_parent;
-
-		spin_unlock(>d_lock);
-		spin_lock(_parent->d_lock);
-
-		/* go into the first sibling still alive */
-		do {
-			next = child->d_child.next;
-			if (next == _parent->d_subdirs)
-goto ascend;
-			child = list_entry(next, struct dentry, d_child);
-		} while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
-		rcu_read_unlock();
-		goto resume;
-	}
-	rcu_read_unlock();
-	spin_unlock(_parent->d_lock);
-	return;
-}
-
 static int tracefs_parse_options(char *data, struct tracefs_mount_opts *opts)
 {
 	substring_t args[MAX_OPT_ARGS];
@@ -315,49 +234,12 @@ static int tracefs_parse_options(char *data, struct tracefs_mount_opts *opts)
 	return 0;
 }
 
-static int tracefs_apply_options(struct super_block *sb, bool remount)
-{
-	struct tracefs_fs_info *fsi = sb->s_fs_info;
-	struct inode *inode = d_inode(sb->s_root);
-	struct tracefs_mount_opts *opts = >mount_opts;
-	umode_t tmp_mode;
-
-	/*
-	 * On remount, only reset mode/uid/gid if they were provided as mount
-	 * options.
-	 */
-
-	if (!remount || opts->opts & BIT(Opt_mode)) {
-		tmp_mode = READ_ONCE(inode->i_mode) & ~S_IALLUGO;
-		tmp_mode |= opts->mode;
-		WRITE_ONCE(inode->i_mode, tmp_mode);
-	}
-
-	if (!remount || opts->opts & BIT(Opt_uid))
-		inode->i_uid = opts->uid;
-
-	if (!remount || opts->opts & BIT(Opt_gid)) {
-		/* Set all the group ids to the mount option */
-		set_gid(sb->s_root, opts->gid);
-	}
-
-	return 0;
-}
-
 static int tracefs_remount(struct sup

Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 11:52, Steven Rostedt  wrote:
>
> This doesn't work because for tracefs (not eventfs) the dentries are
> created at boot up and before the file system is mounted. This means you
> can't even set a gid in /etc/fstab. This will cause a regression.

Which is why I suggested

   "I think the whole thing was triggered by commit 49d67e445742, and
maybe the fix is to just revert that commit"

there was never any coherent reason for that commit, since the
permissions are dealt with at the mount point.

So this all was triggered by that original change that makes little
sense. The fact that you then apparently changed other things
afterwards too might need fixing.

Or, you know, you could do what I've told you to do at least TEN TIMES
already, which is to not mess with any of this, and just implement the
'->permission()' callback (and getattr() to just make 'ls' look sane
too, rather than silently saying "we'll act as if gid is set right,
but not show it").

Why do you keep bringing up things that I've told you solutions for many times?

 Linus



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 10:50, Steven Rostedt  wrote:
> I think these changes are a bit much for -rc8, don't you?

Oh, absolutely.

None of this is rc8 material apart from the oops fix in your pull
request (which my patch that then removes the whole function did *not*
have - so that patch won't apply as-is to your tree).

But let's aim for a tracefs that doesn't play games with the dcache.

Basically, the dcache is *much* too subtle for a filesystem to mess
with. You should either:

 - be a fully virtual filesystem where the dcache just maintains
everything, and you don't mess with it because you don't need to (eg
tmpfs etc). Everything is in the dcache, and you don't need to touch
it, because you don't even care - the dcache is doing everything for
you.

 - be a "normal" filesystem where the dcache is just a cache, and you
maintain your *own* file data structures, and just get normal lookup
etc ops, and you don't mess with the dcache because it is just doing
its caching thing that you as a filesystem don't care about.

and in both of those cases the filesystem just never has to really
delve into it. But tracefs had this abomination where it maintained
its own data structures, _and_ it tried to make them coherent with the
dcache that maintained part of it. That's the part I hated.

   Linus



Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 10:12, Linus Torvalds
 wrote:
>
> Much better. Now eventfs looks more like a real filesystem, and less
> like an eldritch horror monster that is parts of dcache tackled onto a
> pseudo-filesystem.

Oh, except I think you still need to just remove the 'set_gid()' mess.

It's disgusting and it's wrong, and it's not even what the 'uid'
option does (it only sets the root inode uid).

If you remount the filesystem with different gid values, you get to
keep both broken pieces. And if it isn't a remount, then setting the
root uid is sufficient.

I think the whole thing was triggered by commit 49d67e445742, and
maybe the fix is to just revert that commit.

That commit makes no sense in general, since the default mounting
position for tracefs that the kernel sets up is only accessible to
root anyway.

Alternatively, just do the ->permissions() thing, and allow access to
the group in the mount options.

Getting rid of set_gid() would be this attached lovely patch:

 fs/tracefs/inode.c | 83 ++
 1 file changed, 2 insertions(+), 81 deletions(-)

and would get rid of the final (?) piece of disgusting dcache hackery
that tracefs most definitely should not have.

 Linus
 fs/tracefs/inode.c | 83 ++
 1 file changed, 2 insertions(+), 81 deletions(-)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 62524b20964e..a22253037e3e 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -183,83 +183,6 @@ struct tracefs_fs_info {
 	struct tracefs_mount_opts mount_opts;
 };
 
-static void change_gid(struct dentry *dentry, kgid_t gid)
-{
-	if (!dentry->d_inode)
-		return;
-	dentry->d_inode->i_gid = gid;
-}
-
-/*
- * Taken from d_walk, but without he need for handling renames.
- * Nothing can be renamed while walking the list, as tracefs
- * does not support renames. This is only called when mounting
- * or remounting the file system, to set all the files to
- * the given gid.
- */
-static void set_gid(struct dentry *parent, kgid_t gid)
-{
-	struct dentry *this_parent;
-	struct list_head *next;
-
-	this_parent = parent;
-	spin_lock(_parent->d_lock);
-
-	change_gid(this_parent, gid);
-repeat:
-	next = this_parent->d_subdirs.next;
-resume:
-	while (next != _parent->d_subdirs) {
-		struct tracefs_inode *ti;
-		struct list_head *tmp = next;
-		struct dentry *dentry = list_entry(tmp, struct dentry, d_child);
-		next = tmp->next;
-
-		spin_lock_nested(>d_lock, DENTRY_D_LOCK_NESTED);
-
-		change_gid(dentry, gid);
-
-		/* If this is the events directory, update that too */
-		ti = get_tracefs(dentry->d_inode);
-		if (ti && (ti->flags & TRACEFS_EVENT_INODE))
-			eventfs_update_gid(dentry, gid);
-
-		if (!list_empty(>d_subdirs)) {
-			spin_unlock(_parent->d_lock);
-			spin_release(>d_lock.dep_map, _RET_IP_);
-			this_parent = dentry;
-			spin_acquire(_parent->d_lock.dep_map, 0, 1, _RET_IP_);
-			goto repeat;
-		}
-		spin_unlock(>d_lock);
-	}
-	/*
-	 * All done at this level ... ascend and resume the search.
-	 */
-	rcu_read_lock();
-ascend:
-	if (this_parent != parent) {
-		struct dentry *child = this_parent;
-		this_parent = child->d_parent;
-
-		spin_unlock(>d_lock);
-		spin_lock(_parent->d_lock);
-
-		/* go into the first sibling still alive */
-		do {
-			next = child->d_child.next;
-			if (next == _parent->d_subdirs)
-goto ascend;
-			child = list_entry(next, struct dentry, d_child);
-		} while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
-		rcu_read_unlock();
-		goto resume;
-	}
-	rcu_read_unlock();
-	spin_unlock(_parent->d_lock);
-	return;
-}
-
 static int tracefs_parse_options(char *data, struct tracefs_mount_opts *opts)
 {
 	substring_t args[MAX_OPT_ARGS];
@@ -332,10 +255,8 @@ static int tracefs_apply_options(struct super_block *sb, bool remount)
 	if (!remount || opts->opts & BIT(Opt_uid))
 		inode->i_uid = opts->uid;
 
-	if (!remount || opts->opts & BIT(Opt_gid)) {
-		/* Set all the group ids to the mount option */
-		set_gid(sb->s_root, opts->gid);
-	}
+	if (!remount || opts->opts & BIT(Opt_gid))
+		inode->i_gid = opts->gid;
 
 	return 0;
 }


Re: [PATCH] eventfs: Stop using dcache_readdir() for getdents()

2024-01-03 Thread Linus Torvalds
On Wed, 3 Jan 2024 at 07:24, Steven Rostedt  wrote:
>
> Instead, just have eventfs have its own iterate_shared callback function
> that will fill in the dent entries. This simplifies the code quite a bit.

Much better. Now eventfs looks more like a real filesystem, and less
like an eldritch horror monster that is parts of dcache tackled onto a
pseudo-filesystem.

However, one request, and one nit:

> Also, remove the "lookup" parameter to the create_file/dir_dentry() and
> always have it return a dentry that has its ref count incremented, and
> have the caller call the dput. This simplifies that code as well.

Can you please do that as a separate patch, where the first patch just
cleans up the directory iteration, and the second patch then goes "now
there are no more callers that have the 'lookup' argument set to
false".

Because as-is, the patch is kind of two things mixed up.

The small nit is this:

> +static int eventfs_iterate(struct file *file, struct dir_context *ctx)
>  {
> +   /*
> +* Need to create the dentries and inodes to have a consistent
> +* inode number.
> +*/
> list_for_each_entry_srcu(ei_child, >children, list,
>  srcu_read_lock_held(_srcu)) {
> -   d = create_dir_dentry(ei, ei_child, parent, false);
> -   if (d) {
> -   ret = add_dentries(, d, cnt);
> -   if (ret < 0)
> -   break;
> -   cnt++;
> +
> +   if (ei_child->is_freed)
> +   continue;
> +
> +   name = ei_child->name;
> +
> +   dentry = create_dir_dentry(ei, ei_child, ei_dentry);
> +   if (!dentry)
> +   goto out;
> +   ino = dentry->d_inode->i_ino;
> +   dput(dentry);
> +
> +   if (c > 0) {
> +   c--;
> +   continue;
> }

Just move this "is the position before this name" up to the top of the
loop. Even above the "is_freed" part.

Let's just always count all the entries in the child list.

And same for the ei->nr_entries loop:

> for (i = 0; i < ei->nr_entries; i++) {

where there's no point in creating that dentry just to look up the
inode number, only to then decide "Oh, we already iterated past this
part, so let's not do anything with it".

This wouldn't seem to matter much with a big enough getdents buffer
(which is the normal user level behavior), but it actually does,
because we don't keep track of "we have read to the end of the
directory".

So every readdir ends up effectively doing getdents at least twice:
once to read the directory entries, and then once to just be told
"that was all".

End result: you should strive very hard to *not* waste time on the
directory entries that have already been read, and are less than
'ctx->pos'.

 Linus



Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-30 Thread Linus Torvalds
On Sat, 30 Dec 2023 at 12:41, Linus Torvalds
 wrote:
>
> UNTESTED patch to just do the "this_cpu_write()" parts attached.
> Again, note how we do end up doing that this_cpu_ptr conversion later
> anyway, but at least it's off the critical path.

Also note that while 'this_cpu_ptr()' doesn't exactly generate lovely
code, it really is still better than caching a value in memory.

At least the memory location that 'this_cpu_ptr()' accesses is
slightly more likely to be hot (and is right next to the cpu number,
iirc).

That said, I think we should fix this_cpu_ptr() to not ever generate
that disgusting cltq just because the cpu pointer has the wrong
signedness. I don't quite know how to do it, but this:

  -#define per_cpu_offset(x) (__per_cpu_offset[x])
  +#define per_cpu_offset(x) (__per_cpu_offset[(unsigned)(x)])

at least helps a *bit*. It gets rid of the cltq, at least, but if
somebody actually passes in an 'unsigned long' cpuid, it would cause
an unnecessary truncation.

And gcc still generates

subl$1, %eax#, cpu_nr
addq__per_cpu_offset(,%rax,8), %rcx

instead of just doing

addq__per_cpu_offset-8(,%rax,8), %rcx

because it still needs to clear the upper 32 bits and doesn't know
that the 'xchg()' already did that.

Oh well. I guess even without the -1/+1 games by the OSQ code, we
would still end up with a "movl" just to do that upper bits clearing
that the compiler doesn't know is unnecessary.

I don't think we have any reasonable way to tell the compiler that the
register output of our xchg() inline asm has the upper 32 bits clear.

  Linus



Re: [PATCH next 4/5] locking/osq_lock: Optimise per-cpu data accesses.

2023-12-30 Thread Linus Torvalds
On Fri, 29 Dec 2023 at 12:57, David Laight  wrote:
>
> this_cpu_ptr() is rather more expensive than raw_cpu_read() since
> the latter can use an 'offset from register' (%gs for x86-84).
>
> Add a 'self' field to 'struct optimistic_spin_node' that can be
> read with raw_cpu_read(), initialise on first call.

No, this is horrible.

The problem isn't the "this_cpu_ptr()", it's the rest of the code.

>  bool osq_lock(struct optimistic_spin_queue *lock)
>  {
> -   struct optimistic_spin_node *node = this_cpu_ptr(_node);
> +   struct optimistic_spin_node *node = raw_cpu_read(osq_node.self);

No. Both of these are crap.

> struct optimistic_spin_node *prev, *next;
> int old;
>
> -   if (unlikely(node->cpu == OSQ_UNLOCKED_VAL))
> -   node->cpu = encode_cpu(smp_processor_id());
> +   if (unlikely(!node)) {
> +   int cpu = encode_cpu(smp_processor_id());
> +   node = decode_cpu(cpu);
> +   node->self = node;
> +   node->cpu = cpu;
> +   }

The proper fix here is to not do that silly

node = this_cpu_ptr(_node);
..
node->next = NULL;

dance at all, but to simply do

this_cpu_write(osq_node.next, NULL);

in the first place. That makes the whole thing just a single store off
the segment descriptor.

Yes, you'll eventually end up doing that

node = this_cpu_ptr(_node);

thing because it then wants to use that raw pointer to do

WRITE_ONCE(prev->next, node);

but that's a separate issue and still does not make it worth it to
create a pointless self-pointer.

Btw, if you *really* want to solve that separate issue, then make the
optimistic_spin_node struct not contain the pointers at all, but the
CPU numbers, and then turn those numbers into the pointers the exact
same way it does for the "lock->tail" thing, ie doing that whole

prev = decode_cpu(old);

dance. That *may* then result in avoiding turning them into pointers
at all in some cases.

Also, I think that you might want to look into making OSQ_UNLOCKED_VAL
be -1 instead, and add something like

  #define IS_OSQ_UNLOCKED(x) ((int)(x)<0)

and that would then avoid the +1 / -1 games in encoding/decoding the
CPU numbers. It causes silly code generated like this:

subl$1, %eax#, cpu_nr
...
cltq
addq__per_cpu_offset(,%rax,8), %rcx

which seems honestly stupid. The cltq is there for sign-extension,
which is because all these things are "int", and the "subl" will
zero-extend to 64-bit, not sign-extend.

At that point, I think gcc might be able to just generate

addq__per_cpu_offset-8(,%rax,8), %rcx

but honestly, I think it would be nicer to just have decode_cpu() do

unsigned int cpu_nr = encoded_cpu_val;

return per_cpu_ptr(_node, cpu_nr);

and not have the -1/+1 at all.

Hmm?

UNTESTED patch to just do the "this_cpu_write()" parts attached.
Again, note how we do end up doing that this_cpu_ptr conversion later
anyway, but at least it's off the critical path.

 Linus
 kernel/locking/osq_lock.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 75a6f6133866..c3a166b7900c 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -92,14 +92,14 @@ osq_wait_next(struct optimistic_spin_queue *lock,
 
 bool osq_lock(struct optimistic_spin_queue *lock)
 {
-	struct optimistic_spin_node *node = this_cpu_ptr(_node);
+	struct optimistic_spin_node *node;
 	struct optimistic_spin_node *prev, *next;
 	int curr = encode_cpu(smp_processor_id());
 	int old;
 
-	node->locked = 0;
-	node->next = NULL;
-	node->cpu = curr;
+	this_cpu_write(osq_node.next, NULL);
+	this_cpu_write(osq_node.locked, 0);
+	this_cpu_write(osq_node.cpu, curr);
 
 	/*
 	 * We need both ACQUIRE (pairs with corresponding RELEASE in
@@ -112,7 +112,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
 		return true;
 
 	prev = decode_cpu(old);
-	node->prev = prev;
+	this_cpu_write(osq_node.prev, prev);
+
+	node = this_cpu_ptr(_node);
 
 	/*
 	 * osq_lock()			unqueue


Re: [PATCH next 0/5] locking/osq_lock: Optimisations to osq_lock code

2023-12-30 Thread Linus Torvalds
On Fri, 29 Dec 2023 at 12:52, David Laight  wrote:
>
> David Laight (5):
>   Move the definition of optimistic_spin_node into osf_lock.c
>   Clarify osq_wait_next()

I took these two as preparatory independent patches, with that
osq_wait_next() clarification split into two.

I also did the renaming that Waiman asked for.

   Linus



Re: [PATCH next 3/5] locking/osq_lock: Clarify osq_wait_next()

2023-12-29 Thread Linus Torvalds
On Fri, 29 Dec 2023 at 12:56, David Laight  wrote:
>
> osq_wait_next() is passed 'prev' from osq_lock() and NULL from osq_unlock()
> but only needs the 'cpu' value to write to lock->tail.
> Just pass prev->cpu or OSQ_UNLOCKED_VAL instead.
>
> Also directly return NULL or 'next' instead of breaking the loop.

Please split these two totally independent things out of the patch,
just to make things much more obvious.

I like the new calling convention, but I don't like how the patch
isn't obviously just that.

In fact, I'd take your patch #1 and just the calling convention change
from #3 as "these are obviously not changing anything at all, only
moving things to more local places".

I'd also take the other part of #3 as a "clearly doesn't change
anything" but it should be a separate patch, and it should be done
differently: make 'next' be local to just *inside* the for-loop (in
fact, make it local to the if-statement that sets it), to clarify the
whole thing that it can never be non-NULL at the top of the loop, and
can never have any long-term semantics.

The other parts actually change some logic, and would need the OSQ
people to take a more serious look.

Linus



Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-14 Thread Linus Torvalds
On Thu, 14 Dec 2023 at 12:35, Steven Rostedt  wrote:
>
> On Thu, 14 Dec 2023 11:44:55 -0800
> Linus Torvalds  wrote:
>
> > On Thu, 14 Dec 2023 at 08:55, Steven Rostedt  wrote:
> > >
> > > And yes, this does get called in NMI context.
> >
> > Not on an i486-class machine they won't. You don't have a local apic
> > on those, and they won't have any NMI sources under our control (ie
> > NMI does exist, but we're talking purely legacy NMI for "motherboard
> > problems" like RAM parity errors etc)
>
> Ah, so we should not worry about being in NMI context without a 64bit cmpxchg?

.. on x86.

Elsewhere, who knows?

It is *probably* true in most situations. '32-bit' => 'legacy' =>
'less likely to have fancy profiling / irq setups'.

But I really don't know.

> > So no. You need to forget about the whole "do a 64-bit cmpxchg on
> > 32-bit architectures" as being some kind of solution in the short
> > term.
>
> But do all archs have an implementation of cmpxchg64, even if it requires
> disabling interrupts? If not, then I definitely cannot remove this code.

We have a generic header file, so anybody who uses that would get the
fallback version, ie

arch_cmpxchg64 -> generic_cmpxchg64_local -> __generic_cmpxchg64_local

which does that irq disabling thing.

But no, not everybody is guaranteed to use that fallback. From a quick
look, ARC, hexagon and CSky don't do this, for example.

And then I got bored and stopped looking.

My guess is that *most* 32-bit architectures do not have a 64-bit
cmpxchg - not even the irq-safe one.

For the UP case you can do your own, of course.

Linus



Re: [PATCH v3] ring-buffer: Remove 32bit timestamp logic

2023-12-14 Thread Linus Torvalds
On Thu, 14 Dec 2023 at 12:30, Linus Torvalds
 wrote:
>
> Read my email. Don't do random x86-centric things. We have that
>
>   #ifndef system_has_cmpxchg64
>   #define system_has_cmpxchg64() false
>   #endif
>
> which should work.

And again, by "should work" I mean that it would disable this entirely
on things like arm32 until the arm people decide they care. But at
least it won't use an unsafe non-working 64-bit cmpxchg.

And no, for 6.7, only fix reported bugs. No big reorgs at all,
particularly for something that likely has never been hit by any user
and sounds like this all just came out of discussion.

  Linus



Re: [PATCH v3] ring-buffer: Remove 32bit timestamp logic

2023-12-14 Thread Linus Torvalds
On Thu, 14 Dec 2023 at 12:18, Steven Rostedt  wrote:
>
> For this issue of the 64bit cmpxchg, is there any config that works for any
> arch that do not have a safe 64-bit cmpxchg? At least for 486, is the
> second half of the if condition reasonable?
>
> if (IS_ENABLED(CONFIG_X86_32) && !IS_ENABLED(CONFIG_X86_CMPXCHG64)) {
> if (unlikely(in_nmi()))
> return NULL;
> }

No.

Read my email. Don't do random x86-centric things. We have that

  #ifndef system_has_cmpxchg64
  #define system_has_cmpxchg64() false
  #endif

which should work.

NOTE! The above is for 32-bit architectures only! For 64-bit ones
either just use cmpxchg directly. And if you need a 128-bit one,
there's system_has_cmpxchg128...

 Linus



Re: [PATCH v3] ring-buffer: Remove 32bit timestamp logic

2023-12-14 Thread Linus Torvalds
On Thu, 14 Dec 2023 at 09:53, Steven Rostedt  wrote:
>
> +   /*
> +* For architectures that can not do cmpxchg() in NMI, or require
> +* disabling interrupts to do 64-bit cmpxchg(), do not allow them
> +* to record in NMI context.
> +*/
> +   if ((!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) ||
> +(IS_ENABLED(CONFIG_X86_32) && 
> !IS_ENABLED(CONFIG_X86_CMPXCHG64))) &&
> +   unlikely(in_nmi())) {
> +   return NULL;
> +   }

Again, this is COMPLETE GARBAGE.

You're using "ARCH_HAVE_NMI_SAFE_CMPXCHG" to test something that just
isn't what it's about.

Having a NMI-safe cmpxchg does *not* mean that you actualyl have a
NMI-safe 64-bit version.

You can't test it that way.

Stop making random changes that just happen to work on the one machine
you tested it on.

   Linus



Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-14 Thread Linus Torvalds
On Thu, 14 Dec 2023 at 08:55, Steven Rostedt  wrote:
>
> And yes, this does get called in NMI context.

Not on an i486-class machine they won't. You don't have a local apic
on those, and they won't have any NMI sources under our control (ie
NMI does exist, but we're talking purely legacy NMI for "motherboard
problems" like RAM parity errors etc)

> I had a patch that added:
>
> +   /* ring buffer does cmpxchg, make sure it is safe in NMI context */
> +   if (!IS_ENABLED(CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG) &&
> +   (unlikely(in_nmi( {
> +   return NULL;
> +   }

CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG doesn't work on x86 in this context,
because the issue is that yes, there's a safe 'cmpxchg', but that's
not what you want.

You want the save cmpxchg64, which is an entirely different beast.

And honestly, I don't think that NMI_SAFE_CMPXCHG covers the
double-word case anywhere else either, except purely by luck.

In mm/slab.c, we also use a double-wide cmpxchg, and there the rule
has literally been that it's conditional on

 (a) system_has_cmpxchg64() existing as a macro

 (b) using that macro to then gate - at runtime - whether it actually
works or not

I think - but didn't check - that we essentially only enable the
two-word case on x86 as a result, and fall back to the slow case on
all other architectures - and on the i486 case.

That said, other architectures *do* have a working double-word
cmpxchg, but I wouldn't guarantee it. For example, 32-bit arm does
have one using ldrexd/strexd, but that only exists on arm v6+.

And guess what? You'll silently get a "disable interrupts, do it as a
non-atomic load-store" on arm too for that case. And again, pre-v6 arm
is about as relevant as i486 is, but my point is, that double-word
cmpxchg you rely on simply DOES NOT EXIST on 32-bit platforms except
under special circumstances.

So this isn't a "x86 is the odd man out". This is literally generic.

> Now back to my original question. Are you OK with me sending this to you
> now, or should I send you just the subtle fixes to the 32-bit rb_time_*
> code and keep this patch for the merge window?

I'm absolutely not taking some untested random "let's do 64-bit
cmpxchg that we know is broken on 32-bit using broken conditionals"
shit.

What *would* work is that slab approach, which is essentially

  #ifndef system_has_cmpxchg64
  #define system_has_cmpxchg64() false
  #endif

...
if (!system_has_cmpxchg64())
return error or slow case

do_64bit_cmpxchg_case();

(although the slub case is much more indirect, and uses a
__CMPXCHG_DOUBLE flag that only gets set when that define exists etc).

But that would literally cut off support for all non-x86 32-bit architectures.

So no. You need to forget about the whole "do a 64-bit cmpxchg on
32-bit architectures" as being some kind of solution in the short
term.

   Linus



Re: [PATCH] ring-buffer: Remove 32bit timestamp logic

2023-12-13 Thread Linus Torvalds
On Wed, 13 Dec 2023 at 18:45, Steven Rostedt  wrote:
>
> tl;dr;  The ring-buffer timestamp requires a 64-bit cmpxchg to keep the
> timestamps in sync (only in the slow paths). I was told that 64-bit cmpxchg
> can be extremely slow on 32-bit architectures. So I created a rb_time_t
> that on 64-bit was a normal local64_t type, and on 32-bit it's represented
> by 3 32-bit words and a counter for synchronization. But this now requires
> three 32-bit cmpxchgs for where one simple 64-bit cmpxchg would do.

It's not that a 64-bit cmpxchg is even slow. It doesn't EXIST AT ALL
on older 32-bit x86 machines.

Which is why we have

arch/x86/lib/cmpxchg8b_emu.S

which emulates it on machines that don't have the CX8 capability
("CX8" being the x86 capability flag name for the cmpxchg8b
instruction, aka 64-bit cmpxchg).

Which only works because those older 32-bit cpu's also don't do SMP,
so there are no SMP cache coherency issues, only interrupt atomicity
issues.

IOW, the way to do an atomic 64-bit cmpxchg on the affected hardware
is to simply disable interrupts.

In other words - it's not just slow.  It's *really* slow. As in 10x
slower, not "slightly slower".

> We started discussing how much time this is actually saving to be worth the
> complexity, and actually found some hardware to test. One Atom processor.

That atom processor won't actually show the issue. It's much too
recent. So your "test" is actually worthless.

And you probably did this all with a kernel config that had
CONFIG_X86_CMPXCHG64 set anyway, which wouldn't even boot on a i486
machine.

So in fact your test was probably doubly broken, in that not only
didn't you test the slow case, you tested something that wouldn't even
have worked in the environment where the slow case happened.

Now, the real question is if anybody cares about CPUs that don't have
cmpxchg8b support.

IBecause in practice, it's really just old 486-class machines (and a
couple of clone manufacturers who _claimed_ to be Pentium class, but
weren't - there was also some odd thing with Windows breaking if you
had CPUID claiming to support CX8

We dropped support for the original 80386 some time ago. I'd actually
be willing to drop support for ll pre-cmpxchg8b machines, and get rid
of the emulation.

I also suspect that from a perf angle, none of this matters. The
emulation being slow probably is a non-issue, simply because even if
you run on an old i486 machine, you probably won't be doing perf or
tracing on it.

 Linus



Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Linus Torvalds
On Wed, 1 Nov 2023 at 21:02, Linus Torvalds
 wrote:
>
> kmalloc() isn't just about "use physically contiguous allocations".
> It's also more memory-efficient, and a *lot* faster than vmalloc(),
> which has to play VM tricks.

I've pulled this, but I think you should do something like the
attached (UNTESTED!) patch.

Linus
 kernel/module/decompress.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/kernel/module/decompress.c b/kernel/module/decompress.c
index 4156d59be440..474e68f0f063 100644
--- a/kernel/module/decompress.c
+++ b/kernel/module/decompress.c
@@ -100,7 +100,7 @@ static ssize_t module_gzip_decompress(struct load_info *info,
 	s.next_in = buf + gzip_hdr_len;
 	s.avail_in = size - gzip_hdr_len;
 
-	s.workspace = vmalloc(zlib_inflate_workspacesize());
+	s.workspace = kvmalloc(zlib_inflate_workspacesize(), GFP_KERNEL);
 	if (!s.workspace)
 		return -ENOMEM;
 
@@ -138,7 +138,7 @@ static ssize_t module_gzip_decompress(struct load_info *info,
 out_inflate_end:
 	zlib_inflateEnd();
 out:
-	vfree(s.workspace);
+	kvfree(s.workspace);
 	return retval;
 }
 #elif defined(CONFIG_MODULE_COMPRESS_XZ)
@@ -241,7 +241,7 @@ static ssize_t module_zstd_decompress(struct load_info *info,
 	}
 
 	wksp_size = zstd_dstream_workspace_bound(header.windowSize);
-	wksp = vmalloc(wksp_size);
+	wksp = kvmalloc(wksp_size, GFP_KERNEL);
 	if (!wksp) {
 		retval = -ENOMEM;
 		goto out;
@@ -284,7 +284,7 @@ static ssize_t module_zstd_decompress(struct load_info *info,
 	retval = new_size;
 
  out:
-	vfree(wksp);
+	kvfree(wksp);
 	return retval;
 }
 #else


Re: [GIT PULL] Modules changes for v6.7-rc1

2023-11-02 Thread Linus Torvalds
On Wed, 1 Nov 2023 at 10:13, Luis Chamberlain  wrote:
>
> The only thing worth highligthing is that gzip moves to use vmalloc() instead 
> of
> kmalloc just as we had a fix for this for zstd on v6.6-rc1.

Actually, that's almost certainly entirely the wrong thing to do.

Unless you *know* that the allocation is large, you shouldn't use
vmalloc(). And since kmalloc() has worked fine, you most definitely
don't know that.

So we have 'kvmalloc()' *exactly* for this reason, which is a "use
kmalloc, unless that is too small, then use vmalloc".

kmalloc() isn't just about "use physically contiguous allocations".
It's also more memory-efficient, and a *lot* faster than vmalloc(),
which has to play VM tricks.

So this "just switch to vmalloc()" is entirely wrong.

  Linus



Re: [GIT PULL] NVDIMM and DAX for 6.5

2023-07-01 Thread Linus Torvalds
On Fri, 30 Jun 2023 at 12:17, Verma, Vishal L  wrote:
>
> On an operational note, as Dan handed off the branch to me for this cycle, we
> missed that the original few commits were inadvertently made on top of a few
> CXL commits that went in in the 6.4-rc cycle via the CXL tree.
>
> git-request-pull included these, and hence they appear in the shortlog and
> diffstat below, but the actual merge correctly identifies and skips over them.
> I kept it as it is to preserve the linux-next soak time, but if I should have
> done it differently, please let me know.

No, this was the right thing to do. Apart from a slightly odd choice
of base for this all, the pull looks perfectly normal.

Thanks,
Linus



Re: [RFC][PATCH] fix short copy handling in copy_mc_pipe_to_iter()

2022-06-13 Thread Linus Torvalds
On Sun, Jun 12, 2022 at 5:10 PM Al Viro  wrote:
>
> Unlike other copying operations on ITER_PIPE, copy_mc_to_iter() can
> result in a short copy.  In that case we need to trim the unused
> buffers, as well as the length of partially filled one - it's not
> enough to set ->head, ->iov_offset and ->count to reflect how
> much had we copied.  Not hard to fix, fortunately...
>
> I'd put a helper (pipe_discard_from(pipe, head)) into pipe_fs_i.h,
> rather than iov_iter.c -

Actually, since this "copy_mc_xyz()" stuff is going to be entirely
impossible to debug and replicate for any normal situation, I would
suggest we take the approach that we (long ago) used to take with
copy_from_user(): zero out the destination buffer, so that developers
that can't test the faulting behavior don't have to worry about it.

And then the existing code is fine: it will break out of the loop, but
it won't do the odd revert games and the "randomnoise.len -= rem"
thing that I can't wrap my head around.

Hmm?

Linus



Re: [PATCH v3.4] capabilities: require CAP_SETFCAP to map uid 0

2021-04-20 Thread Linus Torvalds
On Tue, Apr 20, 2021 at 9:47 AM Christian Brauner
 wrote:
>
> If there's no objections then Linus can probably just pick up the single
> patch here directly:

I'm ok with that assuming I don't hear any last-minute concerns..

I'll plan on appling it later today, so anybody with concerns please
holler quickly.

I don't want to leave it to much later in the week, and I may or may
not be functional tomorrow (getting my second vaccine shot, some
people react more strongly to it, so I'm leaving the possibility open
of not getting out of bed ;)

 Linus


Re: [patch 11/12] gcov: clang: fix clang-11+ build

2021-04-19 Thread Linus Torvalds
On Mon, Apr 19, 2021 at 2:37 PM Nathan Chancellor  wrote:
>
> This should not have been merged into mainline by itself. It was a fix
> for "gcov: use kvmalloc()", which is still in -mm/-next. Merging it
> alone has now broken the build:
>
> https://github.com/ClangBuiltLinux/continuous-integration2/runs/2384465683?check_suite_focus=true
>
> Could it please be reverted in mainline [..]

Now reverted in my tree.

Sasha and stable cc'd too, since it was apparently auto-selected there..

   Linus


Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Linus Torvalds
On Mon, Apr 19, 2021 at 11:38 AM Paolo Bonzini  wrote:
>
> It changes it for the worse, in that access to fields that are shared
> across threads *must* either use atomic types

Well, we won't be using those broken types in the core kernel, so that
would all be entirely on the Rust side.

And I don't expect the Rust side to do a lot of non-locked accesses,
which presumably shouldn't need any of this anyway.

If Rust code ends up accessing actual real kernel data structures with
memory ordering, then that will be to types that do *not* follow the
useless C++ atomics, and that in turn presumably means that it will be
done as "unsafe" helpers that do what the LKMM does (ie READ_ONCE()
and all the rest of it).

 Linus


Re: [PATCH 00/13] [RFC] Rust support

2021-04-19 Thread Linus Torvalds
On Mon, Apr 19, 2021 at 2:36 AM Peter Zijlstra  wrote:
>
> I also don't see how this is better than seq_cst.
>
> But yes, not broken, but also very much not optimal.

I continue to feel like kernel people should just entirely ignore the
C++ memory ordering standard.

It's inferior to what we already have, and simply not helpful. It
doesn't actually solve any problems as far as the kernel is concerned,
and it generates its own set of issues (ie assuming that the compiler
supports it, and assuming the compiler gets it right).

The really subtle cases that it could have been helpful for (eg RCU,
or the load-store control dependencies) were _too_ subtle for the
standard.

And I do not believe Rust changes _any_ of that.

Any kernel Rust code will simply have to follow the LKMM rules, and
use the kernel model for the interfaces. Things like the C++ memory
model is simply not _relevant_ to the kernel.

 Linus


Linux 5.12-rc8

2021-04-18 Thread Linus Torvalds
rik Strupe (1):
  ARM: 9071/1: uprobes: Don't hook on thumb instructions

Hans de Goede (4):
  AMD_SFH: Removed unused activecontrolstatus member from the
amd_mp2_dev struct
  AMD_SFH: Add sensor_mask module parameter
  AMD_SFH: Add DMI quirk table for BIOS-es which don't set the
activestatus bits
  drm/i915/display/vlv_dsi: Do not skip panel_pwr_cycle_delay when
disabling the panel

Hauke Mehrtens (1):
  mtd: rawnand: mtk: Fix WAITRDY break condition and timeout

Heiner Kallweit (1):
  r8169: don't advertise pause in jumbo mode

Hristo Venev (2):
  net: sit: Unregister catch-all devices
  net: ip6_tunnel: Unregister catch-all devices

Jaegeuk Kim (1):
  dm verity fec: fix misaligned RS roots IO

Jakub Kicinski (2):
  ethtool: fix kdoc attr name
  ethtool: pause: make sure we init driver stats

Jason Xing (1):
  i40e: fix the panic when running bpf in xdpdrv mode

Jernej Skrabec (1):
  arm64: dts: allwinner: h6: beelink-gs1: Remove ext. 32 kHz osc reference

Jia-Ju Bai (1):
  HID: alps: fix error return code in alps_input_configured()

Jiapeng Zhong (1):
  HID: wacom: Assign boolean values to a bool variable

Jisheng Zhang (4):
  arm64: kprobes: Restore local irqflag if kprobes is cancelled
  riscv: add do_page_fault and do_trap_break into the kprobes blacklist
  riscv: kprobes/ftrace: Add recursion protection to the ftrace callback
  riscv: keep interrupts disabled for BREAKPOINT exception

Joakim Zhang (1):
  MAINTAINERS: update maintainer entry for freescale fec driver

Johannes Berg (1):
  gcov: clang: fix clang-11+ build

John Paul Adrian Glaubitz (2):
  ia64: tools: remove inclusion of ia64-specific version of errno.h header
  ia64: tools: remove duplicate definition of ia64_mf() on ia64

Jolly Shah (1):
  scsi: libsas: Reset num_scatter if libata marks qc as NODATA

Jonathon Reinhart (1):
  net: Make tcp_allowed_congestion_control readonly in non-init netns

Kefeng Wang (1):
  riscv: Fix spelling mistake "SPARSEMEM" to "SPARSMEM"

Laurent Pinchart (2):
  dmaengine: xilinx: dpdma: Fix descriptor issuing on video group
  dmaengine: xilinx: dpdma: Fix race condition in done IRQ

Lijun Pan (5):
  ibmvnic: correctly use dev_consume/free_skb_irq
  ibmvnic: avoid calling napi_disable() twice
  ibmvnic: remove duplicate napi_schedule call in do_reset function
  ibmvnic: remove duplicate napi_schedule call in open function
      MAINTAINERS: update my email

Linus Torvalds (2):
  readdir: make sure to verify directory entry for legacy interfaces too
  Linux 5.12-rc8

Luke D Jones (1):
  HID: asus: Add support for 2021 ASUS N-Key keyboard

Lv Yunlong (1):
  dmaengine: Fix a double free in dma_async_device_register

Lyude Paul (1):
  drm/i915/dpcd_bl: Don't try vesa interface unless specified by VBT

Marek Behún (1):
  i2c: mv64xxx: Fix random system lock caused by runtime PM

Matti Vaittinen (1):
  gpio: sysfs: Obey valid_mask

Maxime Ripard (2):
  MAINTAINERS: Add our new mailing-list
  MAINTAINERS: Match on allwinner keyword

Michael Brown (1):
  xen-netback: Check for hotplug-status existence before watching

Mike Christie (1):
  scsi: iscsi: Fix iSCSI cls conn state

Nathan Chancellor (1):
  arm64: alternatives: Move length validation in alternative_{insn, endif}

Nicolas Dichtel (2):
  doc: move seg6_flowlabel to seg6-sysctl.rst
  vrf: fix a comment about loopback device

Or Cohen (1):
  net/sctp: fix race condition in sctp_destroy_sock

Pablo Neira Ayuso (3):
  netfilter: flowtable: fix NAT IPv6 offload mangling
  netfilter: conntrack: do not print icmpv6 as unknown via /proc
  netfilter: nftables: clone set element expression template

Pali Rohár (1):
  net: phy: marvell: fix detection of PHY on Topaz switches

Pavel Begunkov (1):
  io_uring: fix early sqd_list removal sqpoll hangs

Peter Collingbourne (1):
  arm64: fix inline asm in load_unaligned_zeropad()

Phillip Potter (1):
  net: geneve: check skb is large enough for IPv4/IPv6 header

Ping Cheng (1):
  HID: wacom: set EV_KEY and EV_ABS only for non-HID_GENERIC type of devices

Rafael J. Wysocki (1):
  ACPI: x86: Call acpi_boot_table_init() after acpi_table_upgrade()

Randy Dunlap (5):
  mm: eliminate "expecting prototype" kernel-doc warnings
  csky: change a Kconfig symbol name to fix e1000 build error
  ia64: remove duplicate entries in generic_defconfig
  ia64: fix discontig.c section mismatches
  lib: remove "expecting prototype" kernel-doc warnings

Reiji Watanabe (1):
  KVM: VMX: Don't use vcpu->run->internal.ndata as an array index

Robert Richter (1):
  cxl/mem: Force array size of mem_commands[] to CXL_MEM_COMMAND_ID_MAX

Russell King (1):
  ARM: footbridge: fix PCI interrupt mapping

Shawn Guo (1):
  soc: qcom: geni: shield geni_icc_get(

Re: [GIT PULL] Re: ARM SoC fixes for v5.12, part 2

2021-04-18 Thread Linus Torvalds
On Sun, Apr 18, 2021 at 12:24 PM Arnd Bergmann  wrote:
>
> I forgot to add the '[GIT PULL]' in the subject line, replying to myself
> here to ensure it hits the right email folder.

Well, the reply hit _my_ search criterion, but the pr-tracker-bot is a
bit more picky.

As a result, I don't think you'll get any automated reply from the bot.

So here's the manual one,

   Linus


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Linus Torvalds
On Sat, Apr 17, 2021 at 1:35 PM Al Viro  wrote:
>
> No, wait - we have non-NULL buf->prev_reclen, so we'll hit
> with buf->error completely ignored.  Nevermind.

Yeah, I'm pretty sure I even tested that -EINTR case at one point.

Of course, it could easily have gotten broken again, so that's not a
strong argument.

That said, the "buf->err" handling has always been very confusing, and
it would be lovely to get rid of that confusion.

I don't remember why we did it that way, but I think it's because
low-level filesystems ended up changing the error that the filldir()
function returned or something like that.

 Linus


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Linus Torvalds
On Sat, Apr 17, 2021 at 12:44 PM Eric Dumazet  wrote:
>
> I thought put_cmsg() callers were from the kernel, with no possibility
> for user to abuse this interface trying to push GB of data.

My point is that "I thought" is not good enough for the unsafe interfaces.

It needs to be "I can see that the arguments are properly verified".

That is literally why they are called "unsafe". You need to make the
uses obviously safe. Because the functions themselves don't do that.

Linus


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Linus Torvalds
On Sat, Apr 17, 2021 at 9:08 AM Linus Torvalds
 wrote:
>
> Side note: I'm, looking at the readdir cases that I wrote, and I have
> to just say that is broken too. So "stones and glass houses" etc, and
> I'll have to fix that too.

In particular, the very very old OLD_READDIR interface that only fills
in one dirent at a time didn't call verify_dirent_name(). Same for the
compat version.

This requires a corrupt filesystem to be an issue (and even then,
most/all would have the length of a directory entry in an 'unsigned
char', so even corrupt filesystems would generally never have a
negative name length).

So I don't think it's an issue in _practice_, but at the same time it
is very much an example of the same issue that put_cmsg() has in
net-next: unsafe user copies should be fully guarded and not have some
"but this would never happen because callers would never do anything
bad".

Al - fairly trivial patch applied, comments?

 Linus
 fs/readdir.c | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/fs/readdir.c b/fs/readdir.c
index 19434b3c982c..09e8ed7d4161 100644
--- a/fs/readdir.c
+++ b/fs/readdir.c
@@ -150,6 +150,9 @@ static int fillonedir(struct dir_context *ctx, const char *name, int namlen,
 
 	if (buf->result)
 		return -EINVAL;
+	buf->result = verify_dirent_name(name, namlen);
+	if (buf->result < 0)
+		return buf->result;
 	d_ino = ino;
 	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
 		buf->result = -EOVERFLOW;
@@ -405,6 +408,9 @@ static int compat_fillonedir(struct dir_context *ctx, const char *name,
 
 	if (buf->result)
 		return -EINVAL;
+	buf->result = verify_dirent_name(name, namlen);
+	if (buf->result < 0)
+		return buf->result;
 	d_ino = ino;
 	if (sizeof(d_ino) < sizeof(ino) && d_ino != ino) {
 		buf->result = -EOVERFLOW;


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Linus Torvalds
On Sat, Apr 17, 2021 at 9:03 AM Linus Torvalds
 wrote:
>
> Really. The "unsafe" user accesses are named that way very explicitly,
> and for a very very good reason: the safety needs to be guaranteed and
> obvious within the context of those accesses. Not within some "oh,
> nobody will ever call this with a negative argument" garbage bullshit.

Side note: I'm, looking at the readdir cases that I wrote, and I have
to just say that is broken too. So "stones and glass houses" etc, and
I'll have to fix that too.

 Linus


Re: [PATCH] x86/uaccess: small optimization in unsafe_copy_to_user()

2021-04-17 Thread Linus Torvalds
On Fri, Apr 16, 2021 at 12:24 PM Eric Dumazet  wrote:
>
> From: Eric Dumazet 
>
> We have to loop only to copy u64 values.
> After this first loop, we copy at most one u32, one u16 and one byte.

As Al mentioned, at least in trivial cases the compiler understands
that the subsequent loops can only be executed once, because earlier
loops made sure that 'len' is always smaller than 2*size.

But apparently the problem is the slightly more complex cases where
the compiler just messes up and loses sight of that. Oh well.

So the patch looks fine to me.

HOWEVER.

Looking at the put_cmsg() case in net-next, I'm very *VERY* unhappy
about how you use those "unsafe" user access functions.

Why? Because the point of the "unsafe" is to be a big red flag and
make sure you are very VERY careful with it.

And that code isn't.

In particular, what if the "int len" argument is negative? Maybe it
cannot happen, but when it comes to things like those unsafe user
accessors, I really really want to see that all the arguments are
*checked*.

And you don't.

You do

if (!user_write_access_begin(cm, cmlen))

ahead of time, and that will do basic range checking, but "cmlen" is

sizeof(struct cmsghdr) + (len))

so it's entirely possible that "cmlen" has a valid value, but "len"
(and thus "cmlen - sizeof(*cm)", which is the length argument to the
unsafe user copy) might be negative and that is never checked.

End result: I want people to be a LOT more careful when they use those
unsafe user space accessors. You need to make it REALLY REALLY obvious
that everything you do is safe. And it's not at all obvious in the
context of put_cmsg() - the safety currently relies on every single
caller getting it right.

So either fix "len" to be some restricted type (ie "unsigned short"),
or make really really sure that "len" is valid (ie never negative, and
the cmlen addition didn't overflow.

Really. The "unsafe" user accesses are named that way very explicitly,
and for a very very good reason: the safety needs to be guaranteed and
obvious within the context of those accesses. Not within some "oh,
nobody will ever call this with a negative argument" garbage bullshit.

  Linus


Re: [PATCH 04/13] Kbuild: Rust support

2021-04-16 Thread Linus Torvalds
On Fri, Apr 16, 2021 at 6:38 AM Peter Zijlstra  wrote:
>
> AFAICT rust has try/throw/catch exception handling (like
> C++/Java/others) which is typically implemented with stack unwinding of
> its own.

I was assuming that the kernel side would never do that.

There's some kind of "catch_unwind()" thing that catches a Rust
"panic!" thing, but I think it's basically useless for the kernel.

Typical Rust error handling should match the regular kernel
IS_ERR/ERR_PTR/PTR_ERR model fairly well, although the syntax is
fairly different (and it's not limited to pointers).

Linus


Re: [PATCH 00/13] [RFC] Rust support

2021-04-14 Thread Linus Torvalds
On Wed, Apr 14, 2021 at 1:10 PM Matthew Wilcox  wrote:
>
> There's a philosophical point to be discussed here which you're skating
> right over!  Should rust-in-the-linux-kernel provide the same memory
> allocation APIs as the rust-standard-library, or should it provide a Rusty
> API to the standard-linux-memory-allocation APIs?

Yeah, I think that the standard Rust API may simply not be acceptable
inside the kernel, if it has similar behavior to the (completely
broken) C++ "new" operator.

So anything that does "panic!" in the normal Rust API model needs to
be (statically) caught, and never exposed as an actual call to
"panic()/BUG()" in the kernel.

So "Result" is basically the way to go, and if the standard Rust
library alloc() model is based on "panic!" then that kind of model
must simply not be used in the kernel.

 Linus

   Linus


Re: [PATCH 00/13] [RFC] Rust support

2021-04-14 Thread Linus Torvalds
On Wed, Apr 14, 2021 at 11:46 AM  wrote:
>
> Some of you have noticed the past few weeks and months that
> a serious attempt to bring a second language to the kernel was
> being forged. We are finally here, with an RFC that adds support
> for Rust to the Linux kernel.

So I replied with my reactions to a couple of the individual patches,
but on the whole I don't hate it.

HOWEVER.

I do think that the "run-time failure panic" is a fundamental issue.

I may not understand the ramifications of when it can happen, so maybe
it's less of an issue than I think it is, but very fundamentally I
think that if some Rust allocation can cause a panic, this is simply
_fundamentally_ not acceptable.

Allocation failures in a driver or non-core code - and that is by
definition all of any new Rust code - can never EVER validly cause
panics. Same goes for "oh, some case I didn't test used 128-bit
integers or floating point".

So if the Rust compiler causes hidden allocations that cannot be
caught and returned as errors, then I seriously think that this whole
approach needs to be entirely NAK'ed, and the Rust infrastructure -
whether at the compiler level or in the kernel wrappers - needs more
work.

So if the panic was just some placeholder for things that _can_ be
caught, then I think that catching code absolutely needs to be
written, and not left as a to-do.

And if the panic situation is some fundamental "this is what the Rust
compiler does for internal allocation failures", then I think it needs
more than just kernel wrapper work - it needs the Rust compiler to be
*fixed*.

Because kernel code is different from random user-space system tools.
Running out of memory simply MUST NOT cause an abort.  It needs to
just result in an error return.

I don't know enough about how the out-of-memory situations would be
triggered and caught to actually know whether this is a fundamental
problem or not, so my reaction comes from ignorance, but basically the
rule has to be that there are absolutely zero run-time "panic()"
calls. Unsafe code has to either be caught at compile time, or it has
to be handled dynamically as just a regular error.

With the main point of Rust being safety, there is no way I will ever
accept "panic dynamically" (whether due to out-of-memory or due to
anything else - I also reacted to the "floating point use causes
dynamic panics") as a feature in the Rust model.

   Linus


Re: [PATCH 09/13] Samples: Rust examples

2021-04-14 Thread Linus Torvalds
On Wed, Apr 14, 2021 at 11:47 AM  wrote:
>
> From: Miguel Ojeda 
>
> A set of Rust modules that showcase how Rust modules look like
> and how to use the abstracted kernel features.

Honestly, I'd like to see a real example. This is fine for testing,
but I'd like to see something a bit more real, and a bit less special
than the Android "binder" WIP that comes a few patches later.

Would there be some kind of real driver or something that people could
use as a example of a real piece of code that actually does something
meaningful?

   Linus


Re: [PATCH 07/13] Rust: Kernel crate

2021-04-14 Thread Linus Torvalds
On Wed, Apr 14, 2021 at 11:47 AM  wrote:
>
> +#[alloc_error_handler]
> +fn oom(_layout: Layout) -> ! {
> +panic!("Out of memory!");
> +}
> +
> +#[no_mangle]
> +pub fn __rust_alloc_error_handler(_size: usize, _align: usize) -> ! {
> +panic!("Out of memory!");
> +}

Again, excuse my lack of internal Rust knowledge, but when do these
end up being an issue?

If the Rust compiler ends up doing hidden allocations, and they then
cause panics, then one of the main *points* of Rustification is
entirely broken. That's 100% the opposite of being memory-safe at
build time.

An allocation failure in some random driver must never ever be
something that the compiler just turns into a panic. It must be
something that is caught and handled synchronously and results in an
ENOMEM error return.

So the fact that the core patches have these kinds of

panic!("Out of memory!");

things in them as part of just the support infrastructure makes me go
"Yeah, that's fundamentally wrong".

And if this is some default that is called only when the Rust code
doesn't have error handling, then once again - I think it needs to be
a *build-time* failure, not a runtime one. Because having unsafe code
that will cause a panic only under very special situations that are
hard to trigger is about the worst possible case.

 Linus


Re: [PATCH 05/13] Rust: Compiler builtins crate

2021-04-14 Thread Linus Torvalds
On Wed, Apr 14, 2021 at 11:46 AM  wrote:
>
> We also need a helpers C source file to contain some forwarders
> to C macros and inlined functions. For the moment, we only need it
> to call the `BUG()` macro, but we will be adding more later.

Not being a Rust person, I can only guess based on random pattern
matching, but this _looks_ like these "panicking intrinsics" panic at
run-time (by calling BUG()).

Is there some way these things could cause built-time link errors
instead, so that if somebody uses 128-bit shifts, or floating point
ops in the rust code, they show up as build failures, not as run-time
ones?

Hmm?

  Linus


Re: Linux 5.12-rc7

2021-04-12 Thread Linus Torvalds
On Sun, Apr 11, 2021 at 10:14 PM Guenter Roeck  wrote:
>
> Qemu test results:
> total: 460 pass: 459 fail: 1
> Failed tests:
> sh:rts7751r2dplus_defconfig:ata:net,virtio-net:rootfs
>
> The failure bisects to commit 0f6925b3e8da ("virtio_net: Do not pull payload 
> in
> skb->head"). It is a spurious problem - the test passes roughly every other
> time. When the failure is seen, udhcpc fails to get an IP address and aborts
> with SIGTERM. So far I have only seen this with the "sh" architecture.

Hmm. Let's add in some more of the people involved in that commit, and
also netdev.

Nothing in there looks like it should have any interaction with
architecture, so that "it happens on sh" sounds odd, but maybe it's
some particular interaction with the qemu environment.

 Linus


Linux 5.12-rc7

2021-04-11 Thread Linus Torvalds
usb: Revert Fix the autosuspend enable and disable

Hao Fang (1):
  i2c: hix5hd2: use the correct HiSilicon copyright

Heiko Carstens (2):
  s390/irq: fix reading of ext_params2 field from lowcore
  s390/setup: use memblock_free_late() to free old stack

Helge Deller (1):
  parisc: parisc-agp requires SBA IOMMU driver

Ian Rogers (1):
  perf arm-spe: Avoid potential buffer overrun

Ido Schimmel (2):
  mlxsw: spectrum: Fix ECN marking in tunnel decapsulation
  selftests: forwarding: vxlan_bridge_1d: Add more ECN decap test cases

Ilya Lipnitskiy (1):
  of: property: fw_devlink: do not link ".*,nr-gpios"

Ilya Maximets (1):
  openvswitch: fix send of uninitialized stack memory in ct limit reply

Jacek Bułatek (1):
  ice: Fix for dereference of NULL pointer

Jack Qiu (1):
  fs: direct-io: fix missing sdio->boundary

Jakub Kicinski (4):
  docs: ethtool: fix some copy-paste errors
  ethtool: un-kdocify extended link state
  ethtool: document reserved fields in the uAPI
  ethtool: fix kdoc in headers

Jin Yao (1):
  perf report: Fix wrong LBR block sorting

Jiri Kosina (1):
  iwlwifi: Fix softirq/hardirq disabling in iwl_pcie_enqueue_hcmd()

Jiri Olsa (1):
  bpf: Take module reference for trampoline in module

Johannes Berg (8):
  iwlwifi: pcie: properly set LTR workarounds on 22000 devices
  iwlwifi: fw: fix notification wait locking
  iwlwifi: mvm: fix beacon protection checks
  rfkill: revert back to old userspace API by default
  mac80211: fix TXQ AC confusion
  cfg80211: check S1G beacon compat element length
  nl80211: fix potential leak of ACL params
  nl80211: fix beacon head validation

John Fastabend (2):
  bpf, sockmap: Fix sk->prot unhash op reset
  bpf, sockmap: Fix incorrect fwd_alloc accounting

John Stultz (1):
  drm/msm: Fix removal of valid error case when checking speed_bin

Jonas Holmberg (1):
  ALSA: aloop: Fix initialization of controls

Jordan Crouse (1):
  mailmap: update email address for Jordan Crouse

Julian Braha (1):
  lib: fix kconfig dependency on ARCH_WANT_FRAME_POINTERS

Kalyan Thota (1):
  drm/msm/disp/dpu1: program 3d_merge only if block is attached

Kamal Heib (1):
  RDMA/qedr: Fix kernel panic when trying to access recv_cq

Krzysztof Goreczny (1):
  ice: prevent ice_open and ice_stop during reset

Krzysztof Kozlowski (2):
  clk: socfpga: fix iomem pointer cast on 64-bit
  i2c: exynos5: correct top kerneldoc

Kumar Kartikeya Dwivedi (1):
  net: sched: bump refcount for new action in ACT replace mode

Kurt Kanzenbach (1):
  net: hsr: Reset MAC header for Tx path

Leon Romanovsky (1):
  RDMA/addr: Be strict with gid size

Libin Yang (5):
  ASoC: SOF: Intel: TGL: fix EHL ops
  ASoC: SOF: Intel: TGL: set shutdown callback to hda_dsp_shutdown
  ASoC: SOF: Intel: ICL: set shutdown callback to hda_dsp_shutdown
  ASoC: SOF: Intel: CNL: set shutdown callback to hda_dsp_shutdown
  ASoC: SOF: Intel: APL: set shutdown callback to hda_dsp_shutdown

Linus Torvalds (1):
  Linux 5.12-rc7

Loic Poulain (1):
  net: qrtr: Fix memory leak on qrtr_tx_wait failure

Lorenz Bauer (2):
  bpf: link: Refuse non-O_RDWR flags in BPF_OBJ_GET
  bpf: program: Refuse non-O_RDWR flags in BPF_OBJ_GET

Lorenzo Bianconi (1):
  mt76: mt7921: fix airtime reporting

Luca Coelho (2):
  iwlwifi: fix 11ax disabled bit in the regulatory capability flags
  iwlwifi: pcie: add support for So-F devices

Luca Fancellu (1):
  xen/evtchn: Change irq_info lock to raw_spinlock_t

Lukasz Bartosik (2):
  clk: fix invalid usage of list cursor in register
  clk: fix invalid usage of list cursor in unregister

Lv Yunlong (7):
  gpu/xen: Fix a use after free in xen_drm_drv_init
  drivers/net/wan/hdlc_fr: Fix a double free in pvc_xmit
  ethernet: myri10ge: Fix a use after free in myri10ge_sw_tso
  net:tipc: Fix a double free in tipc_sk_mcast_rcv
  ethernet/netronome/nfp: Fix a use after free in nfp_bpf_ctrl_msg_rx
  net/rds: Fix a use after free in rds_message_map_pages
  net: broadcom: bcm4908enet: Fix a double free in bcm4908_enet_dma_alloc

Maciej Żenczykowski (1):
  net-ipv6: bugfix - raw & sctp - switch to ipv6_can_nonlocal_bind()

Maciek Borzecki (1):
  cifs: escape spaces in share names

Magnus Karlsson (1):
  i40e: fix receiving of single packets in xsk zero-copy mode

Manivannan Sadhasivam (1):
  MAINTAINERS: Add entry for Qualcomm IPC Router (QRTR) driver

Maor Dickman (2):
  net/mlx5: Delete auxiliary bus driver eth-rep first
  net/mlx5: E-switch, Create vport miss group only if src rewrite
is supported

Marc Kleine-Budde (2):
  can: uapi: can.h: mark union inside struct can_frame packed
  can: mcp251x: fix support for half duplex SPI host controllers

Marco Elver (1):
  kfence, x86: fix preemptible warning on KPTI-enabled sy

Re: [PATCH] vt_ioctl: make VT_RESIZEX behave like VT_RESIZE

2021-04-11 Thread Linus Torvalds
On Sun, Apr 11, 2021 at 2:43 PM Maciej W. Rozycki  wrote:
>
>  So it does trigger with vgacon and my old server, which I have started
> experimenting with and for a start I have switched to a new kernel for an
> unrelated purpose (now that I have relieved it from all its usual tasks
> except for the last remaining one for which I haven't got the required
> user software ported to the new system yet):
>
> "struct vt_consize"->v_vlin is ignored. Please report if you need this.
> "struct vt_consize"->v_clin is ignored. Please report if you need this.

Note that it's entirely possible that things continue to work well
despite this warning. It's unclear to me from your email if you
actually see any difference (and apparently you're not able to see it
right now due to not being close to the machine).

The fact that v_vlin/v_clin are ignored doesn't necessarily mean that
they are different from what they were before, so the warning may be a
non-issue.

> It continues using svgatextmode with its glass (CRT) VT to set my usual
> 80x37 text mode (720x576 pixel resolution) by manipulating the VGA clock
> chip and the CRT controller appropriately for a nice refresh rate of 85Hz:
>
> Chipset = `TVGA8900', Textmode clock = 44.90 MHz, 80x37 chars, CharCell = 
> 9x16. Refresh = 52.51kHz/84.7Hz.

That doesn't seem necessarily wrong to me.

>  So what's the supposed impact of this change that prompted the inclusion
> of the messages?

There _may_ be no impact at all apart from the messages.

The code _used_ to set the scan lines (v_vlin) and font height
(v_clin) from those numbers if they were non-zero, and now it just
ignores them and warns instead.

The code now just sets the font height from the actual font size when
the font is set. Which is honestly the only thing that ever made
sense. Trying to set it with v_clin is ignored, but it's entirely
possible - maybe even likely - that your user of VT_RESIZEX sets it to
the same values it already has.

Exactly because setting a font line number to anything else than the
font size isn't exactly sensible.

But if your screen looks different than it used to, that is obviously
interesting and says something actually changed (outside of the
message itself).

   Linus


Re: [RFC][PATCH] mm: Split page_has_private() in two to better handle PG_private_2

2021-04-08 Thread Linus Torvalds
On Thu, Apr 8, 2021 at 2:15 PM David Howells  wrote:
>
> mm: Split page_has_private() in two to better handle PG_private_2

>From a look through the patch and some (limited) thinking about it, I
like the patch. I think it clarifies the two very different cases, and
makes it clear that one is about that page cleanup, and the other is
about the magical reference counting. The two are separate issues,
even if for PG_private both happen to be true.

So this seems sane to me.

That said, I had a couple of reactions:

> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 04a34c08e0a6..04cb440ce06e 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -832,14 +832,27 @@ static inline void ClearPageSlabPfmemalloc(struct page 
> *page)
>
>  #define PAGE_FLAGS_PRIVATE \
> (1UL << PG_private | 1UL << PG_private_2)

I think this should be re-named to be PAGE_FLAGS_CLEANUP, because I
don't think it makes any other sense to "combine" the two PG_private*
bits any more. No?

> +static inline int page_private_count(struct page *page)
> +{
> +   return test_bit(PG_private, >flags) ? 1 : 0;
> +}

Why is this open-coding the bit test, rather than just doing

return PagePrivate(page) ? 1 : 0;

instead? In fact, since test_bit() _should_ return a 'bool', I think even just

return PagePrivate(page);

should work and give the same result, but I could imagine that some
architecture version of "test_bit()" might return some other non-zero
value (although honestly, I think that should be fixed if so).

Linus


Re: 08ed4efad6: stress-ng.sigsegv.ops_per_sec -41.9% regression

2021-04-08 Thread Linus Torvalds
On Thu, Apr 8, 2021 at 1:32 AM kernel test robot  wrote:
>
> FYI, we noticed a -41.9% regression of stress-ng.sigsegv.ops_per_sec due to 
> commit
> 08ed4efad684 ("[PATCH v10 6/9] Reimplement RLIMIT_SIGPENDING on top of 
> ucounts")

Ouch.

I *think* this test may be testing "send so many signals that it
triggers the signal queue overflow case".

And I *think* that the performance degradation may be due to lots of
unnecessary allocations, because ity looks like that commit changes
__sigqueue_alloc() to do

struct sigqueue *q = kmem_cache_alloc(sigqueue_cachep, flags);

*before* checking the signal limit, and then if the signal limit was
exceeded, it will just be free'd instead.

The old code would check the signal count against RLIMIT_SIGPENDING
*first*, and if there were m ore pending signals then it wouldn't do
anything at all (including not incrementing that expensive atomic
count).

Also, the old code was very careful to only do the "get_user()" for
the *first* signal it added to the queue, and do the "put_user()" for
when removing the last signal. Exactly because those atomics are very
expensive.

The new code just does a lot of these atomics unconditionally.

I dunno. The profile data in there is a bit hard to read, but there's
a lot more cachee misses, and a *lot* of node crossers:

>5961544  +190.4%   17314361perf-stat.i.cache-misses
>   22107466  +119.2%   48457656perf-stat.i.cache-references
> 163292 ą  3%   +4582.0%7645410perf-stat.i.node-load-misses
> 227388 ą  2%   +3708.8%8660824perf-stat.i.node-loads

and (probably as a result) average instruction costs have gone up enormously:

>   3.47   +66.8%   5.79perf-stat.overall.cpi
>  22849   -65.6%   7866
> perf-stat.overall.cycles-between-cache-misses

and it does seem to be at least partly about "put_ucounts()":

>   0.00+4.54.46
> perf-profile.calltrace.cycles-pp.put_ucounts.__sigqueue_free.get_signal.arch_do_signal_or_restart.exit_to_user_mode_prepare

and a lot of "get_ucounts()".

But it may also be that the new "get sigpending" is just *so* much
more expensive than it used to be.

   Linus


Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Linus Torvalds
On Wed, Apr 7, 2021 at 11:47 AM Mikulas Patocka  wrote:
>
> So, we fixed it, but we don't know why.
>
> Peter Xu's patchset that fixed it is here:
> https://lore.kernel.org/lkml/20200821234958.7896-1-pet...@redhat.com/

Yeah, that's the part that ends up being really painful to backport
(with all the subsequent fixes too), so the 4.14 people would prefer
to avoid it.

But I think that if it's a "requires dax pmem and ptrace on top", it
may simply be a non-issue for those users. Although who knows - maybe
that ends up being a real issue on Android..

Linus


Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Linus Torvalds
On Wed, Apr 7, 2021 at 9:33 AM Suren Baghdasaryan  wrote:
>
> Trying my hand at backporting the patchsets Peter mentioned proved
> this to be far from easy with many dependencies. Let me look into
> Vlastimil's suggestion to backport only 17839856fd58 and it sounds
> like 5.4 already followed that path.

Well, in many ways 17839856fd58 was the "simple and obvious" fix, and
I do think it's easily backportable.

But it *did* cause problems too. Those problems may not be issues on
those old kernels, though.

In particular, commit 17839856fd58 caused uffd-wp to stop working
right, and it caused some issues with debugging (I forget the exact
details, but I think it was strace accessing PROT_NONE or write-only
pages or something like that, and COW failed).

But yes, in many ways that commit is a much simpler and more
straightforward one (which is why I tried it once - we ended up with
the much more subtle and far-reaching fixes after the UFFD issues
crept up).

The issues that 17839856fd58 caused may be entire non-events in old
kernels. In fact, the uffd writeprotect API was added fairly recently
(see commit 63b2d4174c4a that made it into v5.7), so the uffd-wp issue
that was triggered probably cannot happen in the old kernels.

The strace issue might not be relevant either, but I forget what the
details were. Mikilas should know.

See

  
https://lore.kernel.org/lkml/alpine.lrh.2.02.2009031328040.6...@file01.intranet.prod.int.rdu2.redhat.com/

for Mikulas report. I never looked into it in detail, because by then
the uffd-wp issue had already come up, so it was juat another nail in
the coffin for that simpler approach.

Mikulas, do you remember?

Linus


Re: [GIT PULL] parisc architecture fixes for kernel v5.12-rc7

2021-04-07 Thread Linus Torvalds
On Wed, Apr 7, 2021 at 2:09 AM Helge Deller  wrote:
>
>   http://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux.git 
> parisc-5.12-3

Not technically related to this pull (which I just did), but doing the
pull did remind me that you're one of the people who don't use signed
tags for pull requests.

I don't require that for kernel.org accounts, and it's not really a
problem, but I do encourage it.  So I thought I'd just mention it in
case it would be something you'd be willing to do..

You're certainly not the only one not using signed tags, but I'm
looking through my recent merges, and _most_ of them are from signed
sources these days.

   Linus


Re: [PATCH 0/5] 4.14 backports of fixes for "CoW after fork() issue"

2021-04-07 Thread Linus Torvalds
On Wed, Apr 7, 2021 at 6:22 AM Vlastimil Babka  wrote:
>
> 1) Ignore the issue (outside of Android at least). The security model of 
> zygote
> is unusual. Where else a parent of fork() doesn't trust the child, which is 
> the
> same binary?

Agreed. I think this is basically an android-only issue (with
_possibly_ some impact on crazy "pin-and-fork" loads), and doesn't
necessarily merit a backport at all.

If Android people insist on using very old kernels, knowing that they
do things that are questionable with those old kernels, at some point
it's just _their_ problem.

 Linus


Re: Linux 5.12-rc6

2021-04-05 Thread Linus Torvalds
On Mon, Apr 5, 2021 at 10:10 AM Guenter Roeck  wrote:
>
> No change in test results since last week [..]

Let's ping Frank for the alignment issue.  If that promised patch
isn't timely (and trivial), I really think that removing the alignment
check is by now the way forward for that libftd failure.

  Linus


Linux 5.12-rc6

2021-04-04 Thread Linus Torvalds
n up probe error labels
  USB: cdc-acm: use negation for NULL checks
  USB: cdc-acm: always claim data interface
  USB: cdc-acm: do not log successful probe on later errors

Jonathan Marek (1):
  pinctrl: qcom: lpass lpi: use default pullup/strength values

Kefeng Wang (2):
  riscv: Drop const annotation for sp
  riscv: Make NUMA depend on MMU

Krzysztof Kozlowski (1):
  extcon: Add stubs for extcon_register_notifier_all() functions

Lars Povlsen (1):
  pinctrl: microchip-sgpio: Fix wrong register offset for IRQ trigger

Linus Torvalds (1):
  Linux 5.12-rc6

Liu Ying (1):
  drm/imx: imx-ldb: Register LDB channel1 when it is the only
channel to be used

Lv Yunlong (1):
  video: hyperv_fb: Fix a double free in hvfb_probe

Marc Zyngier (1):
  KVM: arm64: Fix CPU interface MMIO compatibility detection

Matthew Rosato (1):
  MAINTAINERS: add backups for s390 vfio drivers

Matthew Wilcox (Oracle) (8):
  XArray: Fix split documentation
  XArray: Fix splitting to non-zero orders
  XArray: Add xa_limit_16b
  radix tree test suite: Fix compilation
  radix tree test suite: Register the main thread with the RCU library
  idr test suite: Take RCU read lock in idr_find_test_1
  idr test suite: Create anchor before launching throbber
  idr test suite: Improve reporting from idr_find_test_1

Mauri Sandberg (1):
  MIPS: kernel: setup.c: fix compilation error

Max Filippov (2):
  xtensa: move coprocessor_flush to the .text section
  xtensa: fix uaccess-related livelock in do_page_fault

Mikko Perttunen (1):
  gpu: host1x: Use different lock classes for each client

Nathan Lynch (2):
  powerpc/pseries/mobility: use struct for shared state
  powerpc/pseries/mobility: handle premature return from H_JOIN

Nirmoy Das (1):
  drm/amdgpu: fix offset calculation in amdgpu_vm_bo_clear_mappings()

Oliver Neukum (3):
  cdc-acm: fix BREAK rx code path adding necessary calls
  USB: cdc-acm: untangle a circular dependency between callback and softint
  USB: cdc-acm: downgrade message to debug

Pan Bian (1):
  drm/imx: fix memory leak when fails to init

Paolo Bonzini (4):
  KVM: SVM: load control fields from VMCB12 before checking them
  KVM: SVM: ensure that EFER.SVME is set when running nested guest
or on nested vmexit
  KVM: x86: reduce pvclock_gtod_sync_lock critical sections
  KVM: x86: disable interrupts while pvclock_gtod_sync_lock is taken

Pavel Begunkov (5):
  io_uring: always go for cancellation spin on exec
  io_uring: handle setup-failed ctx in kill_timeouts
  io_uring/io-wq: protect against sprintf overflow
  io_uring: fix EIOCBQUEUED iter revert
  block: don't ignore REQ_NOWAIT for direct IO

Qu Huang (1):
  drm/amdkfd: dqm fence memory corruption

Rafael J. Wysocki (1):
  ACPI: tables: x86: Reserve memory occupied by ACPI tables

Rajendra Nayak (2):
  pinctrl: qcom: sc7280: Fix SDC_QDSD_PINGROUP and UFS_RESET offsets
  pinctrl: qcom: sc7280: Fix SDC1_RCLK configurations

Richard Gong (1):
  firmware: stratix10-svc: reset COMMAND_RECONFIG_FLAG_PARTIAL to 0

Roger Pau Monne (1):
  pinctrl: intel: check REVID register value for device presence

Roja Rani Yarubandi (1):
  soc: qcom-geni-se: Cleanup the code to remove proxy votes

Sean Christopherson (4):
  KVM: x86/mmu: Ensure TLBs are flushed when yielding during GFN range zap
  KVM: x86/mmu: Ensure TLBs are flushed for TDP MMU during NX zapping
  KVM: x86/mmu: Don't allow TDP MMU to yield when recovering NX pages
  kbuild: lto: Merge module sections if and only if
CONFIG_LTO_CLANG is enabled

Shawn Guo (1):
  usb: dwc3: qcom: skip interconnect init for ACPI probe

Shuah Khan (1):
  usbip: vhci_hcd fix shift out-of-bounds in vhci_hub_control()

Siddharth Chandrasekaran (1):
  KVM: make: Fix out-of-source module builds

Stefan Raspl (1):
  tools/kvm_stat: Add restart delay

Steven Rostedt (VMware) (2):
  ftrace: Check if pages were allocated before calling free_pages()
  tracing: Fix stack trace event size

Suzuki K Poulose (2):
  KVM: arm64: Hide system instruction access to Trace registers
  KVM: arm64: Disable guest access to trace filter controls

Takashi Iwai (2):
  ALSA: hda: Re-add dropped snd_poewr_change_state() calls
  ALSA: hda: Add missing sanity checks in PM prepare/complete callbacks

Tetsuo Handa (1):
  reiserfs: update reiserfs_xattrs_initialized() condition

Thierry Reding (2):
  drm/tegra: dc: Restore coupling of display controllers
  drm/tegra: sor: Grab runtime PM reference across reset

Thinh Nguyen (2):
  usb: dwc3: gadget: Set gadget_max_speed when set ssp_rate
  usb: dwc3: gadget: Use max speed if unspecified

Tian Tao (1):
  drm/exynos/decon5433: Remove the unused include statements

Tomas Winkler (1):
  mei: allow map and unmap of client dma buffer only for disconnected client

Ton

Re: [PATCH] firewire: nosy: Fix a use-after-free bug in nosy_ioctl()

2021-04-03 Thread Linus Torvalds
On Fri, Apr 2, 2021 at 11:59 PM Zheyu Ma  wrote:
>
> case NOSY_IOC_START:
> +   list_for_each_entry(tmp, >lynx->client_list, link)
> +   if (tmp == client)
> +   return -EINVAL;

I don't think this is safe.

You are doing this list traversal outside the lock that protects it,
which it taken a line later:

> spin_lock_irq(client_list_lock);
> list_add_tail(>link, >lynx->client_list);
> spin_unlock_irq(client_list_lock);

so the locking is wrong.

However, I think that the proper fix is not just to move the code
inside the locked region (which makes the error handling a bit more
complex than just a return, of course), but to actually instead of
traversing the list, just look if the "client->link" list is empty.

That's what some other parts of that driver already do (ie
nosy_poll()), so I think that ->link field is already always
initialized properly (and it looks like all the list removal is using
"list_del_init()" to initialize it after removing it from a list.

So I think the patch should be something along the lines of

--- a/drivers/firewire/nosy.c
+++ b/drivers/firewire/nosy.c
@@ -346,6 +346,7 @@ nosy_ioctl(struct file *file, unsigned int
cmd, unsigned long arg)
struct client *client = file->private_data;
spinlock_t *client_list_lock = >lynx->client_list_lock;
struct nosy_stats stats;
+   int ret;

switch (cmd) {
case NOSY_IOC_GET_STATS:
@@ -360,11 +361,15 @@ nosy_ioctl(struct file *file,
return 0;

case NOSY_IOC_START:
+   ret = -EBUSY;
spin_lock_irq(client_list_lock);
-   list_add_tail(>link, >lynx->client_list);
+   if (list_empty(>link)) {
+   list_add_tail(>link,
>lynx->client_list);
+   ret = 0;
+   }
spin_unlock_irq(client_list_lock);

-   return 0;
+   return ret;

case NOSY_IOC_STOP:
spin_lock_irq(client_list_lock);

instead. The above is obviously white-space damaged (on purpose - I
don't want to take credit for this patch, I didn't find the problem,
and I have not tested the above in any shape or form).

Zheyu Ma, does something like that work for you?

Comments? Anybody else?

Linus


Re: [GIT PULL] ftrace: Check if pages were allocated before calling free_pages()

2021-04-01 Thread Linus Torvalds
On Thu, Apr 1, 2021 at 1:07 PM Steven Rostedt  wrote:
>
> On Wed, 31 Mar 2021 11:03:21 -0700
> Linus Torvalds  wrote:
>
> > @@ -6231,7 +6231,8 @@ static int ftrace_process_locs(struct module *mod,
> >   if (!addr)
> >   continue;
> >
> > - if (pg->index == pg->size) {
> > + end_offset = (pg->index+1) * sizeof(pg->records[0]);
> > + if (end_offset < PAGE_SIZE << pg->order) {
>
> I believe that needs to be:
>
> if (end_offset >= PAGE_SIZE << pg->order) {

No, but the "<" should be ">". That was just a typo.

It's ok for end_offset to be at the edge. That's the "we filled the
pages completely".

I'm not sure that can actually happen (it depends on the size of the
structure, and whether the size of the allocation is divisible by it),
but it's not wrong if it does.

Think of it this way: imagine that we have one 4kB page, and the size
of the structure is 1kB in size. You can fit 4 structures in it, and
end_offset for the last one will be index=3, so that you'll have:

end_offset = (pg->index+1) * sizeof(pg->records[0]);

which will be

end_offset = (3+1) * 1024;

ie 4096. That just means that the struct fill fill things _up_to_ the
end of the page.

So only when the end_offset is strictly larger than the page would it
have overflowed the allocation.

 Linus


  1   2   3   4   5   6   7   8   9   10   >