On Tue, 11 Nov 2025 09:13:14 +0100
Sebastian Andrzej Siewior <[email protected]> wrote:

> Nope, no read-write lock that can be used in atomic sections. Well,
> there is RCU.

Well, it can't simply be replaced by RCU as the write side is also a
critical path. It happens when new tasks are spawned.

Now we could possibly do some RCU like magic, and remove the lock in the
read, but it would need some care with the writes.

Something like this (untested):

bool trace_pid_list_is_set(struct trace_pid_list *pid_list, unsigned int pid)
{
        union upper_chunk *upper_chunk;
        union lower_chunk *lower_chunk;
        unsigned long flags;
        unsigned int upper1;
        unsigned int upper2;
        unsigned int lower;
        bool ret = false;

        if (!pid_list)
                return false;

        if (pid_split(pid, &upper1, &upper2, &lower) < 0)
                return false;

        upper_chunk = READ_ONCE(pid_list->upper[upper1]);
        if (upper_chunk) {
                lower_chunk = READ_ONCE(upper_chunk->data[upper2]);
                if (lower_chunk)
                        ret = test_bit(lower, lower_chunk->data);
        }

        return ret;
}

Now when all the bits of a chunk is cleared, it goes to a free-list. And
when a new chunk is needed, it acquires it from that free-list. We need to
make sure that the chunk acquired in the read hasn't gone through the
free-list.

Now we could have an atomic counter in the pid_list and make this more of a
seqcount? That is, have the counter updated when a chunk goes to the free
list and also when it is taken from the free list. We could then make this:

 again:
        counter = atomic_read(&pid_list->counter);
        smp_rmb();
        upper_chunk = READ_ONCE(pid_list->upper[upper1]);
        if (upper_chunk) {
                lower_chunk = READ_ONCE(upper_chunk->data[upper2]);
                if (lower_chunk) {
                        ret = test_bit(lower, lower_chunk->data);
                        smp_rmb();
                        if (unlikely(counter != 
atomic_read(&pid_list->counter))) {
                                ret = false;
                                goto again;
                        }
                }
        }


And in the set we need:

        upper_chunk = pid_list->upper[upper1];
        if (!upper_chunk) {
                upper_chunk = get_upper_chunk(pid_list);
                if (!upper_chunk) {
                        ret = -ENOMEM;
                        goto out;
                }
                atomic_inc(&pid_list->counter);
                smp_wmb();
                WRITE_ONCE(pid_list->upper[upper1], upper_chunk);
        }
        lower_chunk = upper_chunk->data[upper2];
        if (!lower_chunk) {
                lower_chunk = get_lower_chunk(pid_list);
                if (!lower_chunk) {
                        ret = -ENOMEM;
                        goto out;
                }
                atomic_inc(&pid_list->counter);
                smp_wmb();
                WRITE_ONCE(upper_chunk->data[upper2], lower_chunk);
        }

and in the clear:

        if (find_first_bit(lower_chunk->data, LOWER_MAX) >= LOWER_MAX) {
                put_lower_chunk(pid_list, lower_chunk);
                WRITE_ONCE(upper_chunk->data[upper2], NULL);
                smp_wmb();
                atomic_inc(&pid_list->counter);
                if (upper_empty(upper_chunk)) {
                        put_upper_chunk(pid_list, upper_chunk);
                        WRITE_ONCE(pid_list->upper[upper1], NULL);
                        smp_wmb();
                        atomic_inc(&pid_list->counter);
                }
        }

That is, the counter gets updated after setting the chunk to NULL and
before assigning it a new value. And reading it, the counter is read before
looking at any of the chunks, and tested after getting the result. If the
value is the same, then the chunks are for the correct PID and haven't
swapped in a free/alloc swap where it's looking at a chunk for a different
PID.

This would allow for the read to not take any locks.

-- Steve

Reply via email to