date:20240220

Re: [PATCH v2] mm: Update mark_victim tracepoints fields

2024-02-20 Thread Michal Hocko

Hi,
sorry I have missed this before.

On Thu 11-01-24 21:05:30, Carlos Galo wrote:
> The current implementation of the mark_victim tracepoint provides only
> the process ID (pid) of the victim process. This limitation poses
> challenges for userspace tools that need additional information
> about the OOM victim. The association between pid and the additional
> data may be lost after the kill, making it difficult for userspace to
> correlate the OOM event with the specific process.

You are correct that post OOM all per-process information is lost. On
the other hand we do dump all this information to the kernel log. Could
you explain why that is not suitable for your purpose?

> In order to mitigate this limitation, add the following fields:
> 
> - UID
>In Android each installed application has a unique UID. Including
>the `uid` assists in correlating OOM events with specific apps.
> 
> - Process Name (comm)
>Enables identification of the affected process.
> 
> - OOM Score
>Allows userspace to get additional insights of the relative kill
>priority of the OOM victim.

What is the oom score useful for?

Is there any reason to provide a different information from the one
reported to the kernel log?
__oom_kill_process:
pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, 
file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n",
message, task_pid_nr(victim), victim->comm, K(mm->total_vm),
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)),
from_kuid(_user_ns, task_uid(victim)),
mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj);

-- 
Michal Hocko
SUSE Labs

Re: [PATCH -next v6 0/2] Make memory reclamation measurable

2024-02-20 Thread Michal Hocko

On Wed 21-02-24 11:00:53, Bixuan Cui wrote:
> 
> 
> 在 2024/2/21 10:22, Steven Rostedt 写道:
> > It's up to the memory management folks to decide on this. -- Steve
> Noted with thanks.

It would be really helpful to have more details on why we need those
trace points.

It is my understanding that you would like to have a more fine grained
numbers for the time duration of different parts of the reclaim process.
I can imagine this could be useful in some cases but is it useful enough
and for a wider variety of workloads? Is that worth a dedicated static
tracepoints? Why an add-hoc dynamic tracepoints or BPF for a very
special situation is not sufficient?

In other words, tell us more about the usecases and why is this
generally useful.

Thanks!
-- 
Michal Hocko
SUSE Labs

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Haitao Huang


[...]


Here the @nr_to_scan is reduced by the number of pages that are  
isolated, but

not actually reclaimed (which is reflected by @cnt).

IIUC, looks you want to make this function do "each cycle" as what you  
mentioned

in the v8 [1]:

I tested with that approach and found we can only target number of
pages
attempted to reclaim not pages actually reclaimed due to the
uncertainty
of how long it takes to reclaim pages. Besides targeting number of
scanned pages for each cycle is also what the ksgxd does.

If we target actual number of pages, sometimes it just takes too long.
I
saw more timeouts with the default time limit when running parallel
selftests.

I am not sure what does "sometimes it just takes too long" mean, but  
what I am
thinking is you are trying to do some perfect but yet complicated code  
here.


I think what I observed was that the try_charge() would block too long  
before getting chance of schedule() to yield, causing more timeouts than  
necessary.

I'll do some re-test to be sure.



For instance, I don't think selftest reflect the real workload, and I  
believe
adjusting the limit of a given EPC cgroup shouldn't be a frequent  
operation,
thus it is acceptable to use some easy-maintain code but less perfect  
code.


Here I still think having @nr_to_scan as a pointer is over-complicated.   
For
example, we can still let sgx_reclaim_pages() to always scan  
SGX_NR_TO_SCAN
pages, but give up when there's enough pages reclaimed or when the EPC  
cgroup

and its descendants have been looped:

unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
{
unsigned int cnt = 0;
...

css_for_each_descendant_pre(pos, css_root) {
...
epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
cnt += sgx_reclaim_pages(_cg->lru);

if (cnt >= SGX_NR_TO_SCAN)
break;
}

...
return cnt;
}

Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually  
reaches any

descendants, but that should be rare and we don't care that much, do we?

I assume you meant @cnt here to be number of pages actually reclaimed.  
This could cause  sgx_epc_cgroup_reclaim_pages() block too long as @cnt  
may always be zero (all pages are too young) and you have to loop all  
descendants.


Thanks

Haitao

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Haitao Huang


StartHi Kai
On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai  wrote:
[...]


So you introduced the work/workqueue here but there's no place which  
actually

queues the work.  IMHO you can either:

1) move relevant code change here; or
2) focus on introducing core functions to reclaim certain pages from a  
given EPC

cgroup w/o workqueue and introduce the work/workqueue in later patch.

Makes sense?



Starting in v7, I was trying to split the big patch, #10 in v6 as you and  
others suggested. My thought process was to put infrastructure needed for  
per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9  
13/15] in the end.


Before that, all reclaimables are tracked in the global LRU so really  
there is no "reclaim certain pages from a  given EPC cgroup w/o workqueue"  
or reclaim through workqueue before that point, as suggested in #2. This  
patch puts down the implementation for both flows but neither used yet, as  
stated in the commit message.


#1 would force me go back and merge the patches again.

Sorry I feel kind of lost on this whole thing by now. It seems so random  
to me. Is there hard rules on this?


I was hoping these statements would help reviewers on the flow of the  
patches.


At the end of [v9 04/15]:

For now, the EPC cgroup simply blocks additional EPC allocation in
sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are
still tracked in the global active list, only reclaimed by the global
reclaimer when the total free page count is lower than a threshold.

Later patches will reorganize the tracking and reclamation code in the
global reclaimer and implement per-cgroup tracking and reclaiming.

At the end of [v9 06/15]:

Next patches will first get the cgroup reclamation flow ready while
keeping pages tracked in the global LRU and reclaimed by ksgxd before we
make the switch in the end for sgx_lru_list() to return per-cgroup
LRU.

At the end of [v9 08/15]:

Both synchronous and asynchronous flows invoke the same top level reclaim
function, and will be triggered later by sgx_epc_cgroup_try_charge()
when usage of the cgroup is at or near its limit.

At the end of [v9 10/15]:
Note at this point, all reclaimable EPC pages are still tracked in the
global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation
is activated yet.


Thanks
Haitao

Re: [PATCH] bus: mhi: host: Change the trace string for the userspace tools mapping

2024-02-20 Thread Manivannan Sadhasivam

On Sun, Feb 18, 2024 at 02:13:39PM +0530, Krishna chaitanya chundru wrote:
> User space tools can't map strings if we use directly, as the string
> address is internal to kernel.
> 
> So add trace point strings for the user space tools to map strings
> properly.
> 
> Signed-off-by: Krishna chaitanya chundru 

Reported-by: Steven Rostedt 
Reviewed-by: Manivannan Sadhasivam 

- Mani

> ---
>  drivers/bus/mhi/host/main.c  | 4 ++--
>  drivers/bus/mhi/host/trace.h | 2 ++
>  2 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c
> index 2d38f6005da6..15d657af9b5b 100644
> --- a/drivers/bus/mhi/host/main.c
> +++ b/drivers/bus/mhi/host/main.c
> @@ -1340,7 +1340,7 @@ static int mhi_update_channel_state(struct 
> mhi_controller *mhi_cntrl,
>   enum mhi_cmd_type cmd = MHI_CMD_NOP;
>   int ret;
>  
> - trace_mhi_channel_command_start(mhi_cntrl, mhi_chan, to_state, 
> "Updating");
> + trace_mhi_channel_command_start(mhi_cntrl, mhi_chan, to_state, 
> TPS("Updating"));
>   switch (to_state) {
>   case MHI_CH_STATE_TYPE_RESET:
>   write_lock_irq(_chan->lock);
> @@ -1407,7 +1407,7 @@ static int mhi_update_channel_state(struct 
> mhi_controller *mhi_cntrl,
>   write_unlock_irq(_chan->lock);
>   }
>  
> - trace_mhi_channel_command_end(mhi_cntrl, mhi_chan, to_state, "Updated");
> + trace_mhi_channel_command_end(mhi_cntrl, mhi_chan, to_state, 
> TPS("Updated"));
>  exit_channel_update:
>   mhi_cntrl->runtime_put(mhi_cntrl);
>   mhi_device_put(mhi_cntrl->mhi_dev);
> diff --git a/drivers/bus/mhi/host/trace.h b/drivers/bus/mhi/host/trace.h
> index d12a98d44272..368515dcb22d 100644
> --- a/drivers/bus/mhi/host/trace.h
> +++ b/drivers/bus/mhi/host/trace.h
> @@ -84,6 +84,8 @@ DEV_ST_TRANSITION_LIST
>  #define dev_st_trans(a, b)   { DEV_ST_TRANSITION_##a, b },
>  #define dev_st_trans_end(a, b)   { DEV_ST_TRANSITION_##a, b }
>  
> +#define TPS(x)   tracepoint_string(x)
> +
>  TRACE_EVENT(mhi_gen_tre,
>  
>   TP_PROTO(struct mhi_controller *mhi_cntrl, struct mhi_chan *mhi_chan,
> 
> ---
> base-commit: ceeb64f41fe6a1eb9fc56d583983a81f8f3dd058
> change-id: 20240218-ftrace_string-7677762aa63c
> 
> Best regards,
> -- 
> Krishna chaitanya chundru 
> 

-- 
மணிவண்ணன் சதாசிவம்

Re: [PATCH v3] modules: wait do_free_init correctly

2024-02-20 Thread Eric Chanudet

On Sun, Feb 18, 2024 at 01:21:53PM -0800, Andrew Morton wrote:
> On Sat, 17 Feb 2024 16:18:10 +0800 Changbin Du  wrote:
> > The synchronization here is just to ensure the module init's been freed
> > before doing W+X checking. But the commit 1a7b7d922081 ("modules: Use
> > vmalloc special flag") moves do_free_init() into a global workqueue
> > instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init
> > has completed. We should wait it via flush_work().
> > 
> > Without this fix, we still could encounter false positive reports in
> > W+X checking, and the rcu synchronization is unnecessary which can
> > introduce significant delay.
> > 
> > Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a
> > PREEMPT_RT kernel.
> >   [0.291444] Freeing unused kernel memory: 5568K
> >   [0.402442] Run /sbin/init as init process
> > 
> > With this fix, the above delay can be eliminated.
> 
> Thanks, I'll queue this as a delta, to be folded into the base patch
> prior to upstreaming.
> 
> I added a Tested-by: Eric, if that's OK by him?

Absolutely, I should have put it in my initial reply.
Adding here as confirmation:
Tested-by: Eric Chanudet 

Thanks,

-- 
Eric Chanudet

Re: [PATCH -next v6 0/2] Make memory reclamation measurable

2024-02-20 Thread Bixuan Cui





在 2024/2/21 10:22, Steven Rostedt 写道:

It's up to the memory management folks to decide on this. -- Steve

Noted with thanks.

Bixuan Cui

Re: [PATCH -next v6 0/2] Make memory reclamation measurable

2024-02-20 Thread Steven Rostedt

On Wed, 21 Feb 2024 09:44:32 +0800
Bixuan Cui  wrote:

> ping~
> 

It's up to the memory management folks to decide on this.

-- Steve


> 在 2024/1/5 9:36, Bixuan Cui 写道:
> > When the system memory is low, kswapd reclaims the memory. The key steps
> > of memory reclamation include
> > 1.shrink_lruvec
> >* shrink_active_list, moves folios from the active LRU to the inactive 
> > LRU
> >* shrink_inactive_list, shrink lru from inactive LRU list
> > 2.shrink_slab
> >* shrinker->count_objects(), calculates the freeable memory
> >* shrinker->scan_objects(), reclaims the slab memory
> > 
> > The existing tracers in the vmscan are as follows:
> > 
> > --do_try_to_free_pages
> > --shrink_zones
> > --trace_mm_vmscan_node_reclaim_begin (tracer)
> > --shrink_node
> > --shrink_node_memcgs
> >--trace_mm_vmscan_memcg_shrink_begin (tracer)
> >--shrink_lruvec
> >  --shrink_list
> >--shrink_active_list
> >   --trace_mm_vmscan_lru_shrink_active (tracer)
> >--shrink_inactive_list
> >   --trace_mm_vmscan_lru_shrink_inactive (tracer)
> >  --shrink_active_list
> >--shrink_slab
> >  --do_shrink_slab
> >  --shrinker->count_objects()
> >  --trace_mm_shrink_slab_start (tracer)
> >  --shrinker->scan_objects()
> >  --trace_mm_shrink_slab_end (tracer)
> >--trace_mm_vmscan_memcg_shrink_end (tracer)
> > --trace_mm_vmscan_node_reclaim_end (tracer)
> > 
> > If we get the duration and quantity of shrink lru and slab,
> > then we can measure the memory recycling, as follows
> > 
> > Measuring memory reclamation with bpf:
> >LRU FILE:
> > CPU COMMShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
> > 7   kswapd0 26  51  32
> > 7   kswapd0 52  47  13
> >SLAB:
> > CPU COMMOBJ_NAMECount_Dur(us) 
> > Freeable(page) Scan_Dur(us) Reclaim(page)
> >  1  kswapd0 super_cache_scan.cfi_jt 2   341 
> >3225 128
> >  7  kswapd0 super_cache_scan.cfi_jt 0   
> > 2247   8524 1024
> >  7  kswapd0 super_cache_scan.cfi_jt 23670   
> >00
> > 
> > For this, add the new tracer to shrink_active_list/shrink_inactive_list
> > and shrinker->count_objects().
> > 
> > Changes:
> > v6: * Add Reviewed-by from Steven Rostedt.
> > v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
> > replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
> >  * Add the explanation for adding new shrink lru events into 'mm: 
> > vmscan: add new event to trace shrink lru'
> > v4: Add Reviewed-by and Changlog to every patch.
> > v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the 
> > trace event.
> > v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the 
> > same time to fix build error.
> > 
> > cuibixuan (2):
> >mm: shrinker: add new event to trace shrink count
> >mm: vmscan: add new event to trace shrink lru
> > 
> >   include/trace/events/vmscan.h | 80 ++-
> >   mm/shrinker.c |  4 ++
> >   mm/vmscan.c   | 11 +++--
> >   3 files changed, 90 insertions(+), 5 deletions(-)
> >

Re: [PATCH -next v6 0/2] Make memory reclamation measurable

2024-02-20 Thread Bixuan Cui


ping~

在 2024/1/5 9:36, Bixuan Cui 写道:

When the system memory is low, kswapd reclaims the memory. The key steps
of memory reclamation include
1.shrink_lruvec
   * shrink_active_list, moves folios from the active LRU to the inactive LRU
   * shrink_inactive_list, shrink lru from inactive LRU list
2.shrink_slab
   * shrinker->count_objects(), calculates the freeable memory
   * shrinker->scan_objects(), reclaims the slab memory

The existing tracers in the vmscan are as follows:

--do_try_to_free_pages
--shrink_zones
--trace_mm_vmscan_node_reclaim_begin (tracer)
--shrink_node
--shrink_node_memcgs
   --trace_mm_vmscan_memcg_shrink_begin (tracer)
   --shrink_lruvec
 --shrink_list
   --shrink_active_list
  --trace_mm_vmscan_lru_shrink_active (tracer)
   --shrink_inactive_list
  --trace_mm_vmscan_lru_shrink_inactive (tracer)
 --shrink_active_list
   --shrink_slab
 --do_shrink_slab
 --shrinker->count_objects()
 --trace_mm_shrink_slab_start (tracer)
 --shrinker->scan_objects()
 --trace_mm_shrink_slab_end (tracer)
   --trace_mm_vmscan_memcg_shrink_end (tracer)
--trace_mm_vmscan_node_reclaim_end (tracer)

If we get the duration and quantity of shrink lru and slab,
then we can measure the memory recycling, as follows

Measuring memory reclamation with bpf:
   LRU FILE:
CPU COMMShrinkActive(us) ShrinkInactive(us)  Reclaim(page)
7   kswapd0 26  51  32
7   kswapd0 52  47  13
   SLAB:
CPU COMMOBJ_NAMECount_Dur(us) 
Freeable(page) Scan_Dur(us) Reclaim(page)
 1  kswapd0 super_cache_scan.cfi_jt 2   341 
   3225 128
 7  kswapd0 super_cache_scan.cfi_jt 0   
2247   8524 1024
 7  kswapd0 super_cache_scan.cfi_jt 23670   
   00

For this, add the new tracer to shrink_active_list/shrink_inactive_list
and shrinker->count_objects().

Changes:
v6: * Add Reviewed-by from Steven Rostedt.
v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to
replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start'
 * Add the explanation for adding new shrink lru events into 'mm: vmscan: 
add new event to trace shrink lru'
v4: Add Reviewed-by and Changlog to every patch.
v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace 
event.
v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same 
time to fix build error.

cuibixuan (2):
   mm: shrinker: add new event to trace shrink count
   mm: vmscan: add new event to trace shrink lru

  include/trace/events/vmscan.h | 80 ++-
  mm/shrinker.c |  4 ++
  mm/vmscan.c   | 11 +++--
  3 files changed, 90 insertions(+), 5 deletions(-)

Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Steven Rostedt

On Tue, 20 Feb 2024 10:40:23 -0500
Steven Rostedt  wrote:

> > Try resetting the info->add_timestamp flags to add_ts_default on goto again
> > within __rb_reserve_next().
> >  
> 
> I was looking at that too, but I don't know how it will make a difference.
> 
> Note, the test that fails is in my test suite, and takes about a half hour
> to get there. Running that suite takes up resources (it's my main test
> suite for all changes). I'm currently testing other patches so I either
> need to figure it out through inspection, or this will need to wait a while.

I did a bit of debugging and it's just getting weird. It triggers when I
run with the "perf" clock, which is the local_clock() (and this has
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y which does some strange paths for
local_clock()).

It appears that on function_graph tracing with perf clock, it fails the
local_cmpxchg a lot! And for some reason, this is causing it to fail the
overflow more often. Well, if it pushes the header page forward too, it
will also cause that to go back and add to the counter.

I tried moving the "again:" to the beginning of the function (after
resetting the add_timestamp flags and length) so that it gets a new
tail_page each time, but that doesn't appear to make a difference either.

It looks like that retry loop in some configs causes it to fail allocation
on the subbuffer over a 1000 times!

Looks like this changes expectations a bit, and will need more design
changes to make it work.

-- Steve

[PATCH v18 5/6] Documentation: tracing: Add ring-buffer mapping

2024-02-20 Thread Vincent Donnefort

It is now possible to mmap() a ring-buffer to stream its content. Add
some documentation and a code example.

Signed-off-by: Vincent Donnefort 

diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 5092d6c13af5..0b300901fd75 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -29,6 +29,7 @@ Linux Tracing Technologies
timerlat-tracer
intel_th
ring-buffer-design
+   ring-buffer-map
stm
sys-t
coresight/index
diff --git a/Documentation/trace/ring-buffer-map.rst 
b/Documentation/trace/ring-buffer-map.rst
new file mode 100644
index ..0426ab4bcf3d
--- /dev/null
+++ b/Documentation/trace/ring-buffer-map.rst
@@ -0,0 +1,106 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Tracefs ring-buffer memory mapping
+==
+
+:Author: Vincent Donnefort 
+
+Overview
+
+Tracefs ring-buffer memory map provides an efficient method to stream data
+as no memory copy is necessary. The application mapping the ring-buffer becomes
+then a consumer for that ring-buffer, in a similar fashion to trace_pipe.
+
+Memory mapping setup
+
+The mapping works with a mmap() of the trace_pipe_raw interface.
+
+The first system page of the mapping contains ring-buffer statistics and
+description. It is referred as the meta-page. One of the most important field 
of
+the meta-page is the reader. It contains the sub-buffer ID which can be safely
+read by the mapper (see ring-buffer-design.rst).
+
+The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It 
is
+therefore effortless to know where the reader starts in the mapping:
+
+.. code-block:: c
+
+reader_id = meta->reader->id;
+reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;
+
+When the application is done with the current reader, it can get a new one 
using
+the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates
+the meta-page fields.
+
+Limitations
+===
+When a mapping is in place on a Tracefs ring-buffer, it is not possible to
+either resize it (either by increasing the entire size of the ring-buffer or
+each subbuf). It is also not possible to use snapshot and causes splice to copy
+the ring buffer data instead of using the copyless swap from the ring buffer.
+
+Concurrent readers (either another application mapping that ring-buffer or the
+kernel with trace_pipe) are allowed but not recommended. They will compete for
+the ring-buffer and the output is unpredictable, just like concurrent readers 
on
+trace_pipe would be.
+
+Example
+===
+
+.. code-block:: c
+
+#include 
+#include 
+#include 
+#include 
+
+#include 
+
+#include 
+#include 
+
+#define TRACE_PIPE_RAW 
"/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw"
+
+int main(void)
+{
+int page_size = getpagesize(), fd, reader_id;
+unsigned long meta_len, data_len;
+struct trace_buffer_meta *meta;
+void *map, *reader, *data;
+
+fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK);
+if (fd < 0)
+exit(EXIT_FAILURE);
+
+map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0);
+if (map == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+meta = (struct trace_buffer_meta *)map;
+meta_len = meta->meta_page_size;
+
+printf("entries:%llu\n", meta->entries);
+printf("overrun:%llu\n", meta->overrun);
+printf("read:   %llu\n", meta->read);
+printf("nr_subbufs: %u\n", meta->nr_subbufs);
+
+data_len = meta->subbuf_size * meta->nr_subbufs;
+data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, 
meta_len);
+if (data == MAP_FAILED)
+exit(EXIT_FAILURE);
+
+if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0)
+exit(EXIT_FAILURE);
+
+reader_id = meta->reader.id;
+reader = data + meta->subbuf_size * reader_id;
+
+printf("Current reader address: %p\n", reader);
+
+munmap(data, data_len);
+munmap(meta, meta_len);
+close (fd);
+
+return 0;
+}
-- 
2.44.0.rc0.258.g7320e95886-goog

[PATCH v18 3/6] tracing: Add snapshot refcount

2024-02-20 Thread Vincent Donnefort

When a ring-buffer is memory mapped by user-space, no trace or
ring-buffer swap is possible. This means the snapshot feature is
mutually exclusive with the memory mapping. Having a refcount on
snapshot users will help to know if a mapping is possible or not.

Instead of relying on the global trace_types_lock, a new spinlock is
introduced to serialize accesses to trace_array->snapshot. This intends
to allow access to that variable in a context where the mmap lock is
already held.

Signed-off-by: Vincent Donnefort 

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..8ae7c2cb63a0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1301,6 +1301,50 @@ static void free_snapshot(struct trace_array *tr)
tr->allocated_snapshot = false;
 }
 
+static int tracing_arm_snapshot_locked(struct trace_array *tr)
+{
+   int ret;
+
+   lockdep_assert_held(_types_lock);
+
+   spin_lock(>snapshot_trigger_lock);
+   if (tr->snapshot == UINT_MAX) {
+   spin_unlock(>snapshot_trigger_lock);
+   return -EBUSY;
+   }
+
+   tr->snapshot++;
+   spin_unlock(>snapshot_trigger_lock);
+
+   ret = tracing_alloc_snapshot_instance(tr);
+   if (ret) {
+   spin_lock(>snapshot_trigger_lock);
+   tr->snapshot--;
+   spin_unlock(>snapshot_trigger_lock);
+   }
+
+   return ret;
+}
+
+int tracing_arm_snapshot(struct trace_array *tr)
+{
+   int ret;
+
+   mutex_lock(_types_lock);
+   ret = tracing_arm_snapshot_locked(tr);
+   mutex_unlock(_types_lock);
+
+   return ret;
+}
+
+void tracing_disarm_snapshot(struct trace_array *tr)
+{
+   spin_lock(>snapshot_trigger_lock);
+   if (!WARN_ON(!tr->snapshot))
+   tr->snapshot--;
+   spin_unlock(>snapshot_trigger_lock);
+}
+
 /**
  * tracing_alloc_snapshot - allocate snapshot buffer.
  *
@@ -1374,10 +1418,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
 
mutex_lock(_types_lock);
 
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret)
-   goto fail_unlock;
-
if (tr->current_trace->use_max_tr) {
ret = -EBUSY;
goto fail_unlock;
@@ -1396,6 +1436,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, 
void *cond_data,
goto fail_unlock;
}
 
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
+   goto fail_unlock;
+
local_irq_disable();
arch_spin_lock(>max_lock);
tr->cond_snapshot = cond_snapshot;
@@ -1440,6 +1484,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
arch_spin_unlock(>max_lock);
local_irq_enable();
 
+   tracing_disarm_snapshot(tr);
+
return ret;
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
@@ -1482,6 +1528,7 @@ int tracing_snapshot_cond_disable(struct trace_array *tr)
 }
 EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable);
 #define free_snapshot(tr)  do { } while (0)
+#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; })
 #endif /* CONFIG_TRACER_SNAPSHOT */
 
 void tracer_tracing_off(struct trace_array *tr)
@@ -6595,11 +6642,12 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 */
synchronize_rcu();
free_snapshot(tr);
+   tracing_disarm_snapshot(tr);
}
 
-   if (t->use_max_tr && !tr->allocated_snapshot) {
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+   if (t->use_max_tr) {
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
goto out;
}
 #else
@@ -6608,8 +6656,13 @@ int tracing_set_tracer(struct trace_array *tr, const 
char *buf)
 
if (t->init) {
ret = tracer_init(t, tr);
-   if (ret)
+   if (ret) {
+#ifdef CONFIG_TRACER_MAX_TRACE
+   if (t->use_max_tr)
+   tracing_disarm_snapshot(tr);
+#endif
goto out;
+   }
}
 
tr->current_trace = t;
@@ -7711,10 +7764,11 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,
if (tr->allocated_snapshot)
ret = resize_buffer_duplicate_size(>max_buffer,
>array_buffer, iter->cpu_file);
-   else
-   ret = tracing_alloc_snapshot_instance(tr);
-   if (ret < 0)
+
+   ret = tracing_arm_snapshot_locked(tr);
+   if (ret)
break;
+
/* Now, we're going to swap */
if (iter->cpu_file == RING_BUFFER_ALL_CPUS) {
local_irq_disable();
@@ -7724,6 +7778,7 @@ tracing_snapshot_write(struct file *filp, const char 
__user *ubuf, size_t cnt,

[PATCH v18 4/6] tracing: Allow user-space mapping of the ring-buffer

2024-02-20 Thread Vincent Donnefort

Currently, user-space extracts data from the ring-buffer via splice,
which is handy for storage or network sharing. However, due to splice
limitations, it is imposible to do real-time analysis without a copy.

A solution for that problem is to let the user-space map the ring-buffer
directly.

The mapping is exposed via the per-CPU file trace_pipe_raw. The first
element of the mapping is the meta-page. It is followed by each
subbuffer constituting the ring-buffer, ordered by their unique page ID:

  * Meta-page -- include/uapi/linux/trace_mmap.h for a description
  * Subbuf ID 0
  * Subbuf ID 1
 ...

It is therefore easy to translate a subbuf ID into an offset in the
mapping:

  reader_id = meta->reader->id;
  reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size;

When new data is available, the mapper must call a newly introduced ioctl:
TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to
point to the next reader containing unread data.

Mapping will prevent snapshot and buffer size modifications.

Signed-off-by: Vincent Donnefort 

diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
index ffcd8dfcaa4f..d25b9d504a7c 100644
--- a/include/uapi/linux/trace_mmap.h
+++ b/include/uapi/linux/trace_mmap.h
@@ -43,4 +43,6 @@ struct trace_buffer_meta {
__u64   Reserved2;
 };
 
+#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1)
+
 #endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8ae7c2cb63a0..67ce7b367edb 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -1176,6 +1176,12 @@ static void tracing_snapshot_instance_cond(struct 
trace_array *tr,
return;
}
 
+   if (tr->mapped) {
+   trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n");
+   trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n");
+   return;
+   }
+
local_irq_save(flags);
update_max_tr(tr, current, smp_processor_id(), cond_data);
local_irq_restore(flags);
@@ -1308,7 +1314,7 @@ static int tracing_arm_snapshot_locked(struct trace_array 
*tr)
lockdep_assert_held(_types_lock);
 
spin_lock(>snapshot_trigger_lock);
-   if (tr->snapshot == UINT_MAX) {
+   if (tr->snapshot == UINT_MAX || tr->mapped) {
spin_unlock(>snapshot_trigger_lock);
return -EBUSY;
}
@@ -6535,7 +6541,7 @@ static void tracing_set_nop(struct trace_array *tr)
 {
if (tr->current_trace == _trace)
return;
-   
+
tr->current_trace->enabled--;
 
if (tr->current_trace->reset)
@@ -8654,15 +8660,31 @@ tracing_buffers_splice_read(struct file *file, loff_t 
*ppos,
return ret;
 }
 
-/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */
 static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, 
unsigned long arg)
 {
struct ftrace_buffer_info *info = file->private_data;
struct trace_iterator *iter = >iter;
+   int err;
 
-   if (cmd)
-   return -ENOIOCTLCMD;
+   if (cmd == TRACE_MMAP_IOCTL_GET_READER) {
+   if (!(file->f_flags & O_NONBLOCK)) {
+   err = ring_buffer_wait(iter->array_buffer->buffer,
+  iter->cpu_file,
+  iter->tr->buffer_percent);
+   if (err)
+   return err;
+   }
 
+   return ring_buffer_map_get_reader(iter->array_buffer->buffer,
+ iter->cpu_file);
+   } else if (cmd) {
+   return -ENOTTY;
+   }
+
+   /*
+* An ioctl call with cmd 0 to the ring buffer file will wake up all
+* waiters
+*/
mutex_lock(_types_lock);
 
iter->wait_index++;
@@ -8675,6 +8697,110 @@ static long tracing_buffers_ioctl(struct file *file, 
unsigned int cmd, unsigned
return 0;
 }
 
+static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf)
+{
+   struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data;
+   struct trace_iterator *iter = >iter;
+   vm_fault_t ret = VM_FAULT_SIGBUS;
+   struct page *page;
+
+   page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file,
+vmf->pgoff);
+   if (!page)
+   return ret;
+
+   get_page(page);
+   vmf->page = page;
+   vmf->page->mapping = vmf->vma->vm_file->f_mapping;
+   vmf->page->index = vmf->pgoff;
+
+   return 0;
+}
+
+#ifdef CONFIG_TRACER_MAX_TRACE
+static int get_snapshot_map(struct trace_array *tr)
+{
+   int err = 0;
+
+   /*
+* Called with mmap_lock held. lockdep would be unhappy if we would now
+* take trace_types_lock. Instead use the specific
+* snapshot_trigger_lock.
+*/
+

[PATCH v18 2/6] ring-buffer: Introducing ring-buffer mapping functions

2024-02-20 Thread Vincent Donnefort

In preparation for allowing the user-space to map a ring-buffer, add
a set of mapping functions:

  ring_buffer_{map,unmap}()
  ring_buffer_map_fault()

And controls on the ring-buffer:

  ring_buffer_map_get_reader()  /* swap reader and head */

Mapping the ring-buffer also involves:

  A unique ID for each subbuf of the ring-buffer, currently they are
  only identified through their in-kernel VA.

  A meta-page, where are stored ring-buffer statistics and a
  description for the current reader

The linear mapping exposes the meta-page, and each subbuf of the
ring-buffer, ordered following their unique ID, assigned during the
first mapping.

Once mapped, no subbuf can get in or out of the ring-buffer: the buffer
size will remain unmodified and the splice enabling functions will in
reality simply memcpy the data instead of swapping subbufs.

Signed-off-by: Vincent Donnefort 

diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h
index fa802db216f9..0841ba8bab14 100644
--- a/include/linux/ring_buffer.h
+++ b/include/linux/ring_buffer.h
@@ -6,6 +6,8 @@
 #include 
 #include 
 
+#include 
+
 struct trace_buffer;
 struct ring_buffer_iter;
 
@@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct 
hlist_node *node);
 #define trace_rb_cpu_prepare   NULL
 #endif
 
+int ring_buffer_map(struct trace_buffer *buffer, int cpu);
+int ring_buffer_unmap(struct trace_buffer *buffer, int cpu);
+struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu,
+  unsigned long pgoff);
+int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu);
 #endif /* _LINUX_RING_BUFFER_H */
diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h
new file mode 100644
index ..ffcd8dfcaa4f
--- /dev/null
+++ b/include/uapi/linux/trace_mmap.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+#ifndef _TRACE_MMAP_H_
+#define _TRACE_MMAP_H_
+
+#include 
+
+/**
+ * struct trace_buffer_meta - Ring-buffer Meta-page description
+ * @meta_page_size:Size of this meta-page.
+ * @meta_struct_len:   Size of this structure.
+ * @subbuf_size:   Size of each sub-buffer.
+ * @nr_subbufs:Number of subbfs in the ring-buffer, including 
the reader.
+ * @reader.lost_events:Number of events lost at the time of the reader 
swap.
+ * @reader.id: subbuf ID of the current reader. ID range [0 : 
@nr_subbufs - 1]
+ * @reader.read:   Number of bytes read on the reader subbuf.
+ * @flags: Placeholder for now, 0 until new features are supported.
+ * @entries:   Number of entries in the ring-buffer.
+ * @overrun:   Number of entries lost in the ring-buffer.
+ * @read:  Number of entries that have been read.
+ * @Reserved1: Reserved for future use.
+ * @Reserved2: Reserved for future use.
+ */
+struct trace_buffer_meta {
+   __u32   meta_page_size;
+   __u32   meta_struct_len;
+
+   __u32   subbuf_size;
+   __u32   nr_subbufs;
+
+   struct {
+   __u64   lost_events;
+   __u32   id;
+   __u32   read;
+   } reader;
+
+   __u64   flags;
+
+   __u64   entries;
+   __u64   overrun;
+   __u64   read;
+
+   __u64   Reserved1;
+   __u64   Reserved2;
+};
+
+#endif /* _TRACE_MMAP_H_ */
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index ca796675c0a1..1d7d7a701867 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -338,6 +339,7 @@ struct buffer_page {
local_t  entries;   /* entries on this page */
unsigned longreal_end;  /* real end of data */
unsigned order; /* order of the page */
+   u32  id;/* ID for external mapping */
struct buffer_data_page *page;  /* Actual data page */
 };
 
@@ -484,6 +486,12 @@ struct ring_buffer_per_cpu {
u64 read_stamp;
/* pages removed since last reset */
unsigned long   pages_removed;
+
+   unsigned intmapped;
+   struct mutexmapping_lock;
+   unsigned long   *subbuf_ids;/* ID to subbuf addr */
+   struct trace_buffer_meta*meta_page;
+
/* ring buffer pages to update, > 0 to add, < 0 to remove */
longnr_pages_to_update;
struct list_headnew_pages; /* new pages to add */
@@ -1548,6 +1556,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(_buffer->irq_work.waiters);
init_waitqueue_head(_buffer->irq_work.full_waiters);
+

[PATCH v18 1/6] ring-buffer: Zero ring-buffer sub-buffers

2024-02-20 Thread Vincent Donnefort

In preparation for the ring-buffer memory mapping where each subbuf will
be accessible to user-space, zero all the page allocations.

Signed-off-by: Vincent Donnefort 
Reviewed-by: Masami Hiramatsu (Google) 

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index fd4bfe3ecf01..ca796675c0a1 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu 
*cpu_buffer,
 
list_add(>list, pages);
 
-   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags,
+   page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu),
+   mflags | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page)
goto free_pages;
@@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long 
nr_pages, int cpu)
 
cpu_buffer->reader_page = bpage;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, 
cpu_buffer->buffer->subbuf_order);
+   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO,
+   cpu_buffer->buffer->subbuf_order);
if (!page)
goto fail_free_reader;
bpage->page = page_address(page);
@@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, 
int cpu)
if (bpage->data)
goto out;
 
-   page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY,
+   page = alloc_pages_node(cpu_to_node(cpu),
+   GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO,
cpu_buffer->buffer->subbuf_order);
if (!page) {
kfree(bpage);
-- 
2.44.0.rc0.258.g7320e95886-goog

[PATCH v18 0/6] Introducing trace buffer mapping by user-space

2024-02-20 Thread Vincent Donnefort

The tracing ring-buffers can be stored on disk or sent to network
without any copy via splice. However the later doesn't allow real time
processing of the traces. A solution is to give userspace direct access
to the ring-buffer pages via a mapping. An application can now become a
consumer of the ring-buffer, in a similar fashion to what trace_pipe
offers.

Support for this new feature can already be found in libtracefs from
version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE.

Vincent

v17 -> v18:
  * Fix lockdep_assert_held
  * Fix spin_lock_init typo
  * Fix CONFIG_TRACER_MAX_TRACE typo

v16 -> v17:
  * Documentation and comments improvements.
  * Create get/put_snapshot_map() for clearer code.
  * Replace kzalloc with kcalloc.
  * Fix -ENOMEM handling in rb_alloc_meta_page().
  * Move flush(cpu_buffer->reader_page) behind the reader lock.
  * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer
mutex. (removes READ_ONCE/WRITE_ONCE accesses).

v15 -> v16:
  * Add comment for the dcache flush.
  * Remove now unnecessary WRITE_ONCE for the meta-page.

v14 -> v15:
  * Add meta-page and reader-page flush. Intends to fix the mapping
for VIVT and aliasing-VIPT data caches.
  * -EPERM on VM_EXEC.
  * Fix build warning !CONFIG_TRACER_MAX_TRACE.

v13 -> v14:
  * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu)
  * on unmap, sync meta-page teardown with the reader_lock instead of
the synchronize_rcu.
  * Add a dedicated spinlock for trace_array ->snapshot and ->mapped.
(intends to fix a lockdep issue)
  * Add kerneldoc for flags and Reserved fields.
  * Add kselftest for snapshot/map mutual exclusion.

v12 -> v13:
  * Swap subbufs_{touched,lost} for Reserved fields.
  * Add a flag field in the meta-page.
  * Fix CONFIG_TRACER_MAX_TRACE.
  * Rebase on top of trace/urgent.
  * Add a comment for try_unregister_trigger()

v11 -> v12:
  * Fix code sample mmap bug.
  * Add logging in sample code.
  * Reset tracer in selftest.
  * Add a refcount for the snapshot users.
  * Prevent mapping when there are snapshot users and vice versa.
  * Refine the meta-page.
  * Fix types in the meta-page.
  * Collect Reviewed-by.

v10 -> v11:
  * Add Documentation and code sample.
  * Add a selftest.
  * Move all the update to the meta-page into a single
rb_update_meta_page().
  * rb_update_meta_page() is now called from
ring_buffer_map_get_reader() to fix NOBLOCK callers.
  * kerneldoc for struct trace_meta_page.
  * Add a patch to zero all the ring-buffer allocations.

v9 -> v10:
  * Refactor rb_update_meta_page()
  * In-loop declaration for foreach_subbuf_page()
  * Check for cpu_buffer->mapped overflow

v8 -> v9:
  * Fix the unlock path in ring_buffer_map()
  * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer
  * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a)

v7 -> v8:
  * Drop the subbufs renaming into bpages
  * Use subbuf as a name when relevant

v6 -> v7:
  * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/
  * Support for subbufs
  * Rename subbufs into bpages

v5 -> v6:
  * Rebase on next-20230802.
  * (unsigned long) -> (void *) cast for virt_to_page().
  * Add a wait for the GET_READER_PAGE ioctl.
  * Move writer fields update (overrun/pages_lost/entries/pages_touched)
in the irq_work.
  * Rearrange id in struct buffer_page.
  * Rearrange the meta-page.
  * ring_buffer_meta_page -> trace_buffer_meta_page.
  * Add meta_struct_len into the meta-page.

v4 -> v5:
  * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3)

v3 -> v4:
  * Add to the meta-page:
   - pages_lost / pages_read (allow to compute how full is the
 ring-buffer)
   - read (allow to compute how many entries can be read)
   - A reader_page struct.
  * Rename ring_buffer_meta_header -> ring_buffer_meta
  * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page
  * Properly consume events on ring_buffer_map_get_reader_page() with
rb_advance_reader().

v2 -> v3:
  * Remove data page list (for non-consuming read)
** Implies removing order > 0 meta-page
  * Add a new meta page field ->read
  * Rename ring_buffer_meta_page_header into ring_buffer_meta_header

v1 -> v2:
  * Hide data_pages from the userspace struct
  * Fix META_PAGE_MAX_PAGES
  * Support for order > 0 meta-page
  * Add missing page->mapping.

Vincent Donnefort (6):
  ring-buffer: Zero ring-buffer sub-buffers
  ring-buffer: Introducing ring-buffer mapping functions
  tracing: Add snapshot refcount
  tracing: Allow user-space mapping of the ring-buffer
  Documentation: tracing: Add ring-buffer mapping
  ring-buffer/selftest: Add ring-buffer mapping test

 Documentation/trace/index.rst |   1 +
 Documentation/trace/ring-buffer-map.rst   | 106 +
 include/linux/ring_buffer.h   |   7 +
 include/uapi/linux/trace_mmap.h   |  48 +++
 kernel/trace/ring_buffer.c| 385

Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Huang, Kai

On Tue, 2024-02-20 at 14:18 +0100, Michal Koutný wrote:
> On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai"  
> wrote:
> > I am not sure, but is it possible or legal for an ancestor to have less 
> > limit
> > than children?
> 
> Why not?
> It is desired for proper hiearchical delegation and the tightest limit
> of ancestors applies (cf memory.max).
> 

OK.  Thanks for the info.

Re: [PATCH v3 1/7] remoteproc: Add TEE support

2024-02-20 Thread Mathieu Poirier

Good morning,

On Wed, Feb 14, 2024 at 06:21:21PM +0100, Arnaud Pouliquen wrote:
> From: Arnaud Pouliquen 
> 
> Add a remoteproc TEE (Trusted Execution Environment) driver
> that will be probed by the TEE bus. If the associated Trusted
> application is supported on secure part this device offers a client
> interface to load a firmware in the secure part.
> This firmware could be authenticated and decrypted by the secure
> trusted application.
> 
> Signed-off-by: Arnaud Pouliquen 
> ---
> update from V2
> - Use 'tee_rproc' prefix for all functions
> - rename tee_rproc_get_loaded_rsc_table to tee_rproc_find_loaded_rsc_table
> - redefine fonction to better match with the rproc_ops structure format
>   - replace 'struct tee_rproc' parameter by 'struct rproc' parameter
>   - rename 'rproc_tee_get_rsc_table()' to tee_rproc_get_loaded_rsc_table()
> and rework it to remove the cached_table management.
>   - introduce tee_rproc_get_context() to get the tee_rproc struct from the
>   rproc struct
>   - rename tee_rproc_get_loaded_rsc_table() to 
> tee_rproc_find_loaded_rsc_table()
> - remove useless check on tee_rproc_ctx structure in tee_rproc_register()
>   and tee_rproc_unregister()
> - fix test on the return of tee_rproc_ctx = devm_kzalloc()
> - remove useless includes and unused tee_rproc_mem structure.
> ---
>  drivers/remoteproc/Kconfig  |   9 +
>  drivers/remoteproc/Makefile |   1 +
>  drivers/remoteproc/tee_remoteproc.c | 397 
>  include/linux/tee_remoteproc.h  | 102 +++
>  4 files changed, 509 insertions(+)
>  create mode 100644 drivers/remoteproc/tee_remoteproc.c
>  create mode 100644 include/linux/tee_remoteproc.h
> 
> diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig
> index 48845dc8fa85..85299606806c 100644
> --- a/drivers/remoteproc/Kconfig
> +++ b/drivers/remoteproc/Kconfig
> @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC
>  
> It's safe to say N if not interested in using RPU r5f cores.
>  
> +
> +config TEE_REMOTEPROC
> + tristate "trusted firmware support by a TEE application"

s/trusted/Trusted

And the wording will have to change a little, I will advise on this later.

> + depends on OPTEE
> + help
> +   Support for trusted remote processors firmware. The firmware
> +   authentication and/or decryption are managed by a trusted application.
> +   This can be either built-in or a loadable module.
> +
>  endif # REMOTEPROC
>  
>  endmenu
> diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile
> index 91314a9b43ce..fa8daebce277 100644
> --- a/drivers/remoteproc/Makefile
> +++ b/drivers/remoteproc/Makefile
> @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC)   += rcar_rproc.o
>  obj-$(CONFIG_ST_REMOTEPROC)  += st_remoteproc.o
>  obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o
>  obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o
> +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o
>  obj-$(CONFIG_TI_K3_DSP_REMOTEPROC)   += ti_k3_dsp_remoteproc.o
>  obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o
>  obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o
> diff --git a/drivers/remoteproc/tee_remoteproc.c 
> b/drivers/remoteproc/tee_remoteproc.c
> new file mode 100644
> index ..ac727e062d00
> --- /dev/null
> +++ b/drivers/remoteproc/tee_remoteproc.c
> @@ -0,0 +1,397 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved
> + * Author: Arnaud Pouliquen 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#include "remoteproc_internal.h"
> +
> +#define MAX_TEE_PARAM_ARRY_MEMBER4
> +
> +/*
> + * Authentication of the firmware and load in the remote processor memory
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [in]   params[1].memref:  buffer containing the image of the 
> buffer
> + */
> +#define TA_RPROC_FW_CMD_LOAD_FW  1
> +
> +/*
> + * Start the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_START_FW 2
> +
> +/*
> + * Stop the remote processor
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + */
> +#define TA_RPROC_FW_CMD_STOP_FW  3
> +
> +/*
> + * Return the address of the resource table, or 0 if not found
> + * No check is done to verify that the address returned is accessible by
> + * the non secure context. If the resource table is loaded in a protected
> + * memory the access by the non secure context will lead to a data abort.
> + *
> + * [in]  params[0].value.a:  unique 32bit identifier of the remote processor
> + * [out]  params[1].value.a: 32bit LSB resource table memory address
> + * [out]  params[1].value.b: 32bit MSB resource table memory

Re: [PATCH] MAINTAINERS: add memory mapping entry with reviewers

2024-02-20 Thread Vlastimil Babka

On 2/20/24 07:44, Lorenzo Stoakes wrote:
> Recently there have been a number of patches which have affected various
> aspects of the memory mapping logic as implemented in mm/mmap.c where it
> would have been useful for regular contributors to have been notified.
> 
> Add an entry for this part of mm in particular with regular contributors
> tagged as reviewers.
> 
> Signed-off-by: Lorenzo Stoakes 

Acked-by: Vlastimil Babka 

Would be nice if this targetted sub-reviewing could be managed in a simpler
way as part of the MM section instead of having to define a new one which
duplicates half of if, but until we have such support...

> ---
>  MAINTAINERS | 11 +++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index b6c5c92e49db..129a237b7880 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -14121,6 +14121,17 @@ F:   tools/mm/
>  F:   tools/testing/selftests/mm/
>  N:   include/linux/page[-_]*
>  
> +MEMORY MAPPING
> +M:   Andrew Morton 
> +R:   Liam R. Howlett 
> +R:   Vlastimil Babka 
> +R:   Lorenzo Stoakes 
> +L:   linux...@kvack.org
> +S:   Maintained
> +W:   http://www.linux-mm.org
> +T:   git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
> +F:   mm/mmap.c
> +
>  MEMORY TECHNOLOGY DEVICES (MTD)
>  M:   Miquel Raynal 
>  M:   Richard Weinberger

Re: [PATCH v5 0/4] Add samsung-matisselte and common matisse dtsi

2024-02-20 Thread Bjorn Andersson



On Tue, 31 Oct 2023 15:00:54 +0100, Stefan Hansson wrote:
> This series adds a common samsung-matisse dtsi and reworks
> samsung-matisse-wifi to use it, and introduces samsung-matisselte. I
> choose matisselte over matisse-lte as this is how most other devices
> (klte, s3ve3g) do it and it is the codename that Samsung gave the
> device. See individual commits for more information.
> 

Applied, thanks!

[1/4] ARM: dts: qcom: samsung-matisse-common: Add initial common device tree
  (no commit info)
[2/4] dt-bindings: arm: qcom: Add Samsung Galaxy Tab 4 10.1 LTE
  commit: 2a478a521876d9daea016be3e9513a67871169b5
[3/4] ARM: dts: qcom: Add support for Samsung Galaxy Tab 4 10.1 LTE (SM-T535)
  (no commit info)
[4/4] ARM: dts: qcom: samsung-matisse-common: Add UART
  (no commit info)

Best regards,
-- 
Bjorn Andersson

[RFC PATCH 11/14] csky/thread_info: Introduce TIF_NOTIFY_IPI flag

2024-02-20 Thread K Prateek Nayak

Add support for TIF_NOTIFY_IPI on C-SKY. With TIF_NOTIFY_IPI, a sender
sending an IPI to an idle CPU in TIF_POLLING mode will set the
TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the
CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This
avoids spurious calls to schedule_idle() in cases where an IPI does not
necessarily wake up a task on the idle CPU.

Cc: Guo Ren 
Cc: "Rafael J. Wysocki" 
Cc: Daniel Lezcano 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
Cc: Valentin Schneider 
Cc: K Prateek Nayak 
Cc: linux-c...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: K Prateek Nayak 
---
 arch/csky/include/asm/thread_info.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/csky/include/asm/thread_info.h 
b/arch/csky/include/asm/thread_info.h
index b5ed788f0c68..9bc7a037c476 100644
--- a/arch/csky/include/asm/thread_info.h
+++ b/arch/csky/include/asm/thread_info.h
@@ -61,6 +61,7 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_SYSCALL_TRACEPOINT 5   /* syscall tracepoint instrumentation */
 #define TIF_SYSCALL_AUDIT  6   /* syscall auditing */
 #define TIF_NOTIFY_SIGNAL  7   /* signal notifications exist */
+#define TIF_NOTIFY_IPI 8   /* Pending IPI on TIF_POLLLING idle CPU 
*/
 #define TIF_POLLING_NRFLAG 16  /* poll_idle() is TIF_NEED_RESCHED */
 #define TIF_MEMDIE 18  /* is terminating due to OOM killer */
 #define TIF_RESTORE_SIGMASK20  /* restore signal mask in do_signal() */
@@ -73,6 +74,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT)
 #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT)
 #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL)
+#define _TIF_NOTIFY_IPI(1 << TIF_NOTIFY_IPI)
 #define _TIF_UPROBE(1 << TIF_UPROBE)
 #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG)
 #define _TIF_MEMDIE(1 << TIF_MEMDIE)
-- 
2.34.1

[RFC PATCH 07/14] openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag

2024-02-20 Thread K Prateek Nayak

Add support for TIF_NOTIFY_IPI on OpenRISC. With TIF_NOTIFY_IPI, a
sender sending an IPI to an idle CPU in TIF_POLLING mode will set the
TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the
CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This
avoids spurious calls to schedule_idle() in cases where an IPI does not
necessarily wake up a task on the idle CPU.

Cc: Jonas Bonn 
Cc: Stefan Kristiansson 
Cc: Stafford Horne 
Cc: "Rafael J. Wysocki" 
Cc: Daniel Lezcano 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Juri Lelli 
Cc: Vincent Guittot 
Cc: Dietmar Eggemann 
Cc: Steven Rostedt 
Cc: Ben Segall 
Cc: Mel Gorman 
Cc: Daniel Bristot de Oliveira 
Cc: Valentin Schneider 
Cc: K Prateek Nayak 
Cc: linux-openr...@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux...@vger.kernel.org
Signed-off-by: K Prateek Nayak 
---
 arch/openrisc/include/asm/thread_info.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/openrisc/include/asm/thread_info.h 
b/arch/openrisc/include/asm/thread_info.h
index 4af3049c34c2..6a386703bc43 100644
--- a/arch/openrisc/include/asm/thread_info.h
+++ b/arch/openrisc/include/asm/thread_info.h
@@ -92,6 +92,7 @@ register struct thread_info *current_thread_info_reg 
asm("r10");
 * mode
 */
 #define TIF_NOTIFY_SIGNAL  5   /* signal notifications exist */
+#define TIF_NOTIFY_IPI 6   /* Pending IPI on TIF_POLLLING idle CPU 
*/
 #define TIF_SYSCALL_TRACEPOINT  8   /* for ftrace syscall instrumentation 
*/
 #define TIF_RESTORE_SIGMASK 9
 #define TIF_POLLING_NRFLAG 16  /* true if poll_idle() is polling   
 * TIF_NEED_RESCHED
@@ -104,6 +105,7 @@ register struct thread_info *current_thread_info_reg 
asm("r10");
 #define _TIF_NEED_RESCHED  (1<

Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Steven Rostedt

On Tue, 20 Feb 2024 09:50:13 -0500
Mathieu Desnoyers  wrote:

> On 2024-02-20 09:19, Steven Rostedt wrote:
> > On Mon, 19 Feb 2024 18:20:32 -0500
> > Steven Rostedt  wrote:
> >   
> >> Instead of using local_add_return() to reserve the ring buffer data,
> >> Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
> >> reservation with the time keeping code.
> >>
> >> Although, it does not get rid of the double time stamps (before_stamp and
> >> write_stamp), using cmpxchg() does get rid of the more complex case when
> >> an interrupting event occurs between getting the timestamps and reserving
> >> the data, as when that happens, it just tries again instead of dealing
> >> with it.
> >>
> >> Before we had:
> >>
> >>w = local_read(_page->write);
> >>/* get time stamps */
> >>write = local_add_return(length, _page->write);
> >>if (write - length == w) {
> >>/* do simple case */
> >>} else {
> >>/* do complex case */
> >>}
> >>
> >> By switching the local_add_return() to a local_try_cmpxchg() it can now be:
> >>
> >> w = local_read(_page->write);
> >>   again:
> >>/* get time stamps */
> >>if (!local_try_cmpxchg(_page->write, , w + length))
> >>goto again;
> >>
> >> /* do simple case */  
> > 
> > Something about this logic is causing __rb_next_reserve() to sometimes
> > always return -EAGAIN and triggering the:
> > 
> >  RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)
> > 
> > Which disables the ring buffer.
> > 
> > I'm not sure what it is, but until I do, I'm removing the patch from my
> > queue.  
> 
> Try resetting the info->add_timestamp flags to add_ts_default on goto again
> within __rb_reserve_next().
>

I was looking at that too, but I don't know how it will make a difference.

Note, the test that fails is in my test suite, and takes about a half hour
to get there. Running that suite takes up resources (it's my main test
suite for all changes). I'm currently testing other patches so I either
need to figure it out through inspection, or this will need to wait a while.

-- Steve

Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Mathieu Desnoyers


On 2024-02-20 09:19, Steven Rostedt wrote:

On Mon, 19 Feb 2024 18:20:32 -0500
Steven Rostedt  wrote:


Instead of using local_add_return() to reserve the ring buffer data,
Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
reservation with the time keeping code.

Although, it does not get rid of the double time stamps (before_stamp and
write_stamp), using cmpxchg() does get rid of the more complex case when
an interrupting event occurs between getting the timestamps and reserving
the data, as when that happens, it just tries again instead of dealing
with it.

Before we had:

w = local_read(_page->write);
/* get time stamps */
write = local_add_return(length, _page->write);
if (write - length == w) {
/* do simple case */
} else {
/* do complex case */
}

By switching the local_add_return() to a local_try_cmpxchg() it can now be:

 w = local_read(_page->write);
  again:
/* get time stamps */
if (!local_try_cmpxchg(_page->write, , w + length))
goto again;

 /* do simple case */


Something about this logic is causing __rb_next_reserve() to sometimes
always return -EAGAIN and triggering the:

 RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)

Which disables the ring buffer.

I'm not sure what it is, but until I do, I'm removing the patch from my
queue.


Try resetting the info->add_timestamp flags to add_ts_default on goto again
within __rb_reserve_next().

Thanks,

Mathieu


--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

[PATCH] ring-buffer: Do not let subbuf be bigger than write mask

2024-02-20 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The data on the subbuffer is measured by a write variable that also
contains status flags. The counter is just 20 bits in length. If the
subbuffer is bigger than then counter, it will fail.

Make sure that the subbuffer can not be set to greater than the counter
that keeps track of the data on the subbuffer.

Fixes: 2808e31ec12e5 ("ring-buffer: Add interface for configuring trace sub 
buffer size")
Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/ring_buffer.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index aa54266f5e1f..3852f3b001cc 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -5878,6 +5878,10 @@ int ring_buffer_subbuf_order_set(struct trace_buffer 
*buffer, int order)
if (psize <= BUF_PAGE_HDR_SIZE)
return -EINVAL;
 
+   /* Size of a subbuf cannot be greater than the write counter */
+   if (psize > RB_WRITE_MASK + 1)
+   return -EINVAL;
+
old_order = buffer->subbuf_order;
old_size = buffer->subbuf_size;
 
-- 
2.43.0

Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop

2024-02-20 Thread Steven Rostedt

On Mon, 19 Feb 2024 18:20:32 -0500
Steven Rostedt  wrote:

> Instead of using local_add_return() to reserve the ring buffer data,
> Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the
> reservation with the time keeping code.
> 
> Although, it does not get rid of the double time stamps (before_stamp and
> write_stamp), using cmpxchg() does get rid of the more complex case when
> an interrupting event occurs between getting the timestamps and reserving
> the data, as when that happens, it just tries again instead of dealing
> with it.
> 
> Before we had:
> 
>   w = local_read(_page->write);
>   /* get time stamps */
>   write = local_add_return(length, _page->write);
>   if (write - length == w) {
>   /* do simple case */
>   } else {
>   /* do complex case */
>   }
> 
> By switching the local_add_return() to a local_try_cmpxchg() it can now be:
> 
>w = local_read(_page->write);
>  again:
>   /* get time stamps */
>   if (!local_try_cmpxchg(_page->write, , w + length))
>   goto again;
> 
>/* do simple case */

Something about this logic is causing __rb_next_reserve() to sometimes
always return -EAGAIN and triggering the:

RB_WARN_ON(cpu_buffer, ++nr_loops > 1000)

Which disables the ring buffer.

I'm not sure what it is, but until I do, I'm removing the patch from my
queue.

-- Steve

[PATCH v4 3/3] tracing: Move saved_cmdline code into trace_sched_switch.c

2024-02-20 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The code that handles saved_cmdlines is split between the trace.c file and
the trace_sched_switch.c. There's some history to this. The
trace_sched_switch.c was originally created to handle the sched_switch
tracer that was deprecated due to sched_switch trace event making it
obsolete. But that file did not get deleted as it had some code to help
with saved_cmdlines. But trace.c has grown tremendously since then. Just
move all the saved_cmdlines code into trace_sched_switch.c as that's the
only reason that file still exists, and trace.c has gotten too big.

No functional changes.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c  | 515 +-
 kernel/trace/trace.h  |  10 +
 kernel/trace/trace_sched_switch.c | 515 ++
 3 files changed, 528 insertions(+), 512 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 06c593fc93d0..50fab999e72e 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -39,7 +39,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 #include 
@@ -105,7 +104,7 @@ dummy_set_flag(struct trace_array *tr, u32 old_flags, u32 
bit, int set)
  * tracing is active, only save the comm when a trace event
  * occurred.
  */
-static DEFINE_PER_CPU(bool, trace_taskinfo_save);
+DEFINE_PER_CPU(bool, trace_taskinfo_save);
 
 /*
  * Kill all tracing for good (never come back).
@@ -2299,96 +2298,6 @@ void tracing_reset_all_online_cpus(void)
mutex_unlock(_types_lock);
 }
 
-/*
- * The tgid_map array maps from pid to tgid; i.e. the value stored at index i
- * is the tgid last observed corresponding to pid=i.
- */
-static int *tgid_map;
-
-/* The maximum valid index into tgid_map. */
-static size_t tgid_map_max;
-
-#define SAVED_CMDLINES_DEFAULT 128
-#define NO_CMDLINE_MAP UINT_MAX
-/*
- * Preemption must be disabled before acquiring trace_cmdline_lock.
- * The various trace_arrays' max_lock must be acquired in a context
- * where interrupt is disabled.
- */
-static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED;
-struct saved_cmdlines_buffer {
-   unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1];
-   unsigned *map_cmdline_to_pid;
-   unsigned cmdline_num;
-   int cmdline_idx;
-   char saved_cmdlines[];
-};
-static struct saved_cmdlines_buffer *savedcmd;
-
-/* Holds the size of a cmdline and pid element */
-#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s)  \
-   (TASK_COMM_LEN + sizeof((s)->map_cmdline_to_pid[0]))
-
-static inline char *get_saved_cmdlines(int idx)
-{
-   return >saved_cmdlines[idx * TASK_COMM_LEN];
-}
-
-static inline void set_cmdline(int idx, const char *cmdline)
-{
-   strncpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN);
-}
-
-static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s)
-{
-   int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN);
-
-   kmemleak_free(s);
-   free_pages((unsigned long)s, order);
-}
-
-static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val)
-{
-   struct saved_cmdlines_buffer *s;
-   struct page *page;
-   int orig_size, size;
-   int order;
-
-   /* Figure out how much is needed to hold the given number of cmdlines */
-   orig_size = sizeof(*s) + val * SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
-   order = get_order(orig_size);
-   size = 1 << (order + PAGE_SHIFT);
-   page = alloc_pages(GFP_KERNEL, order);
-   if (!page)
-   return NULL;
-
-   s = page_address(page);
-   kmemleak_alloc(s, size, 1, GFP_KERNEL);
-   memset(s, 0, sizeof(*s));
-
-   /* Round up to actual allocation */
-   val = (size - sizeof(*s)) / SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
-   s->cmdline_num = val;
-
-   /* Place map_cmdline_to_pid array right after saved_cmdlines */
-   s->map_cmdline_to_pid = (unsigned *)>saved_cmdlines[val * 
TASK_COMM_LEN];
-
-   s->cmdline_idx = 0;
-   memset(>map_pid_to_cmdline, NO_CMDLINE_MAP,
-  sizeof(s->map_pid_to_cmdline));
-   memset(s->map_cmdline_to_pid, NO_CMDLINE_MAP,
-  val * sizeof(*s->map_cmdline_to_pid));
-
-   return s;
-}
-
-static int trace_create_savedcmd(void)
-{
-   savedcmd = allocate_cmdlines_buffer(SAVED_CMDLINES_DEFAULT);
-
-   return savedcmd ? 0 : -ENOMEM;
-}
-
 int is_tracing_stopped(void)
 {
return global_trace.stop_count;
@@ -2481,201 +2390,6 @@ void tracing_stop(void)
return tracing_stop_tr(_trace);
 }
 
-static int trace_save_cmdline(struct task_struct *tsk)
-{
-   unsigned tpid, idx;
-
-   /* treat recording of idle task as a success */
-   if (!tsk->pid)
-   return 1;
-
-   tpid = tsk->pid & (PID_MAX_DEFAULT - 1);
-
-   /*
-* It's not the end of the world if we don't get
-* the lock, but we also don't want to spin
-* nor do we want

[PATCH v4 2/3] tracing: Move open coded processing of tgid_map into helper function

2024-02-20 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

In preparation of moving the saved_cmdlines logic out of trace.c and into
trace_sched_switch.c, replace the open coded manipulation of tgid_map in
set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it
can be easily moved into trace_sched_switch.c without changing existing
functions in trace.c.

No functional changes.

Signed-off-by: Steven Rostedt (Google) 
---
 kernel/trace/trace.c | 38 +++---
 1 file changed, 23 insertions(+), 15 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 52faa30e64ed..06c593fc93d0 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -5432,10 +5432,31 @@ int trace_keep_overwrite(struct tracer *tracer, u32 
mask, int set)
return 0;
 }
 
-int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
+static int trace_alloc_tgid_map(void)
 {
int *map;
 
+   if (tgid_map)
+   return 0;
+
+   tgid_map_max = pid_max;
+   map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
+  GFP_KERNEL);
+   if (!map)
+   return -ENOMEM;
+
+   /*
+* Pairs with smp_load_acquire() in
+* trace_find_tgid_ptr() to ensure that if it observes
+* the tgid_map we just allocated then it also observes
+* the corresponding tgid_map_max value.
+*/
+   smp_store_release(_map, map);
+   return 0;
+}
+
+int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled)
+{
if ((mask == TRACE_ITER_RECORD_TGID) ||
(mask == TRACE_ITER_RECORD_CMD))
lockdep_assert_held(_mutex);
@@ -5458,20 +5479,7 @@ int set_tracer_flag(struct trace_array *tr, unsigned int 
mask, int enabled)
trace_event_enable_cmd_record(enabled);
 
if (mask == TRACE_ITER_RECORD_TGID) {
-   if (!tgid_map) {
-   tgid_map_max = pid_max;
-   map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map),
-  GFP_KERNEL);
-
-   /*
-* Pairs with smp_load_acquire() in
-* trace_find_tgid_ptr() to ensure that if it observes
-* the tgid_map we just allocated then it also observes
-* the corresponding tgid_map_max value.
-*/
-   smp_store_release(_map, map);
-   }
-   if (!tgid_map) {
+   if (trace_alloc_tgid_map() < 0) {
tr->trace_flags &= ~TRACE_ITER_RECORD_TGID;
return -ENOMEM;
}
-- 
2.43.0

[PATCH v4 1/3] tracing: Have saved_cmdlines arrays all in one allocation

2024-02-20 Thread Steven Rostedt

From: "Steven Rostedt (Google)" 

The saved_cmdlines have three arrays for mapping PIDs to COMMs:

 - map_pid_to_cmdline[]
 - map_cmdline_to_pid[]
 - saved_cmdlines

The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index
into the other arrays. The map_cmdline_to_pid[] is a mapping back to the
full pid as it can be larger than PID_MAX_DEFAULT. And the
saved_cmdlines[] just holds the COMMs associated to the pids.

Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated
together (in reality the saved_cmdlines is just in the memory of the
rounding of the allocation of the structure as it is always allocated in
powers of two). The map_cmdline_to_pid[] array is allocated separately.

Since the rounding to a power of two is rather large (it allows for 8000
elements in saved_cmdlines), also include the map_cmdline_to_pid[] array.
(This drops it to 6000 by default, which is still plenty for most use
cases). This saves even more memory as the map_cmdline_to_pid[] array
doesn't need to be allocated.

Link: 
https://lore.kernel.org/linux-trace-kernel/20240212174011.06821...@gandalf.local.home/

Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic")
Acked-by: Masami Hiramatsu (Google) 
Signed-off-by: Steven Rostedt (Google) 
---
Changes since v4: 
https://lore.kernel.org/linux-trace-kernel/20240216210150.379809...@goodmis.org

- Do not kfree() map_cmdline_to_pid now that it is included in the
  page allocations.

 kernel/trace/trace.c | 18 --
 1 file changed, 8 insertions(+), 10 deletions(-)

diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c
index 8198bfc54b58..52faa30e64ed 100644
--- a/kernel/trace/trace.c
+++ b/kernel/trace/trace.c
@@ -2325,6 +2325,10 @@ struct saved_cmdlines_buffer {
 };
 static struct saved_cmdlines_buffer *savedcmd;
 
+/* Holds the size of a cmdline and pid element */
+#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s)  \
+   (TASK_COMM_LEN + sizeof((s)->map_cmdline_to_pid[0]))
+
 static inline char *get_saved_cmdlines(int idx)
 {
return >saved_cmdlines[idx * TASK_COMM_LEN];
@@ -2339,7 +2343,6 @@ static void free_saved_cmdlines_buffer(struct 
saved_cmdlines_buffer *s)
 {
int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN);
 
-   kfree(s->map_cmdline_to_pid);
kmemleak_free(s);
free_pages((unsigned long)s, order);
 }
@@ -2352,7 +2355,7 @@ static struct saved_cmdlines_buffer 
*allocate_cmdlines_buffer(unsigned int val)
int order;
 
/* Figure out how much is needed to hold the given number of cmdlines */
-   orig_size = sizeof(*s) + val * TASK_COMM_LEN;
+   orig_size = sizeof(*s) + val * SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
order = get_order(orig_size);
size = 1 << (order + PAGE_SHIFT);
page = alloc_pages(GFP_KERNEL, order);
@@ -2364,16 +2367,11 @@ static struct saved_cmdlines_buffer 
*allocate_cmdlines_buffer(unsigned int val)
memset(s, 0, sizeof(*s));
 
/* Round up to actual allocation */
-   val = (size - sizeof(*s)) / TASK_COMM_LEN;
+   val = (size - sizeof(*s)) / SAVED_CMDLINE_MAP_ELEMENT_SIZE(s);
s->cmdline_num = val;
 
-   s->map_cmdline_to_pid = kmalloc_array(val,
- sizeof(*s->map_cmdline_to_pid),
- GFP_KERNEL);
-   if (!s->map_cmdline_to_pid) {
-   free_saved_cmdlines_buffer(s);
-   return NULL;
-   }
+   /* Place map_cmdline_to_pid array right after saved_cmdlines */
+   s->map_cmdline_to_pid = (unsigned *)>saved_cmdlines[val * 
TASK_COMM_LEN];
 
s->cmdline_idx = 0;
memset(>map_pid_to_cmdline, NO_CMDLINE_MAP,
-- 
2.43.0

[PATCH v4 0/3] tracing: Changes to saved_cmdlines

2024-02-20 Thread Steven Rostedt



The first patch is to consolidate the map_cmdline_to_pid into the allocate
page for the saved_cmdlines structure.

The second patch removes some open coded saved_cmdlines logic in trace.c
into a helper function to make it a cleaner move.

The third patch simply moves code around. The goal is to keep all the
saved_cmdlines logic in one place. Currently it's in two files (trace.c and
trace_sched_switch.c). Since trace.c is awfully large, move all the code to
trace_sched_switch.c, as that's its only purpose today anyway.

Changes since v3: 
https://lore.kernel.org/linux-trace-kernel/20240216210047.584712...@goodmis.org/

- The map_cmdline_to_pid field was moved into the pages allocated of the
  structure and that replaced the kmalloc. But that field still had
  kfree() called on it in the freeing of the structure which caused
  a memory corruption.

Steven Rostedt (Google) (3):
  tracing: Have saved_cmdlines arrays all in one allocation
  tracing: Move open coded processing of tgid_map into helper function
  tracing: Move saved_cmdline code into trace_sched_switch.c


 kernel/trace/trace.c  | 509 +
 kernel/trace/trace.h  |  10 +
 kernel/trace/trace_sched_switch.c | 515 ++
 3 files changed, 528 insertions(+), 506 deletions(-)

Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Michal Koutný

On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai"  
wrote:
> I am not sure, but is it possible or legal for an ancestor to have less limit
> than children?

Why not?
It is desired for proper hiearchical delegation and the tightest limit
of ancestors applies (cf memory.max).

Michal

signature.asc
Description: PGP signature

[PATCH] arm64: dts: qcom: sdm632-fairphone-fp3: enable USB-C port handling

2024-02-20 Thread Luca Weiss

Add the definition for the USB-C connector found on this phone and hook
up the relevant bits. This enables USB role switching.

Signed-off-by: Luca Weiss 
---
 arch/arm64/boot/dts/qcom/msm8953.dtsi | 14 ++
 arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts | 31 +--
 2 files changed, 43 insertions(+), 2 deletions(-)

diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi 
b/arch/arm64/boot/dts/qcom/msm8953.dtsi
index 383657407c6f..6726d15c38c3 100644
--- a/arch/arm64/boot/dts/qcom/msm8953.dtsi
+++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi
@@ -1330,6 +1330,20 @@ usb3_dwc3: usb@700 {
snps,hird-threshold = /bits/ 8 <0x00>;
 
maximum-speed = "high-speed";
+
+   usb-role-switch;
+
+   ports {
+   #address-cells = <1>;
+   #size-cells = <0>;
+
+   port@0 {
+   reg = <0>;
+
+   usb_dwc3_hs: endpoint {
+   };
+   };
+   };
};
};
 
diff --git a/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts 
b/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts
index c49a196189e3..2a65849f0da2 100644
--- a/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts
+++ b/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts
@@ -162,6 +162,33 @@ led@3 {
};
 };
 
+_typec {
+   status = "okay";
+
+   connector {
+   compatible = "usb-c-connector";
+
+   power-role = "dual";
+   data-role = "dual";
+   self-powered;
+
+   typec-power-opmode = "default";
+   pd-disable;
+
+   port {
+   pmi632_hs_in: endpoint {
+   remote-endpoint = <_dwc3_hs>;
+   };
+   };
+   };
+};
+
+_vbus {
+   regulator-min-microamp = <50>;
+   regulator-max-microamp = <100>;
+   status = "okay";
+};
+
 _1 {
status = "okay";
vmmc-supply = <_l8>;
@@ -286,8 +313,8 @@  {
status = "okay";
 };
 
-_dwc3 {
-   dr_mode = "peripheral";
+_dwc3_hs {
+   remote-endpoint = <_hs_in>;
 };
 
  {

---
base-commit: 103eb8e019aefd616735200ce46833bc74cfe132
change-id: 20240220-fp3-typec-25eb002db8b5

Best regards,
-- 
Luca Weiss

[PATCH v19 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-20 Thread ankita

From: Ankit Agrawal 

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary chip-to-chip cache coherent interconnect.

The device is peculiar compared to a real PCI device in that whilst
there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
device, it is not used to access device memory once the faster
chip-to-chip interconnect is initialized (occurs at the time of host
system boot). The device memory is accessed instead using the chip-to-chip
interconnect that is exposed as a contiguous physically addressable
region on the host. This device memory aperture can be obtained from host
ACPI table using device_property_read_u64(), according to the FW
specification. Since the device memory is cache coherent with the CPU,
it can be mmap into the user VMA with a cacheable mapping using
remap_pfn_range() and used like a regular RAM. The device memory
is not added to the host kernel, but mapped directly as this reduces
memory wastage due to struct pages.

There is also a requirement of a minimum reserved 1G uncached region
(termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
This is to work around a HW defect. Based on [2], the requisite properties
(uncached, unaligned access) can be achieved through a VM mapping (S1)
of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
a different non-cached property to the reserved 1G region, it needs to
be carved out from the device memory and mapped as a separate region
in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the
Qemu VMA page properties (pgprot) as NORMAL_NC.

Provide a VFIO PCI variant driver that adapts the unique device memory
representation into a more standard PCI representation facing userspace.

The variant driver exposes these two regions - the non-cached reserved
(resmem) and the cached rest of the device memory (termed as usemem) as
separate VFIO 64b BAR regions. This is divergent from the baremetal
approach, where the device memory is exposed as a device memory region.
The decision for a different approach was taken in view of the fact that
it would necessiate additional code in Qemu to discover and insert those
regions in the VM IPA, along with the additional VM ACPI DSDT changes to
communicate the device memory region IPA to the VM workloads. Moreover,
this behavior would have to be added to a variety of emulators (beyond
top of tree Qemu) out there desiring grace hopper support.

Since the device implements 64-bit BAR0, the VFIO PCI variant driver
maps the uncached carved out region to the next available PCI BAR (i.e.
comprising of region 2 and 3). The cached device memory aperture is
assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
device in the VM with the uncached aperture reported as BAR2 region,
the cacheable as BAR4. The variant driver provides emulation for these
fake BARs' PCI config space offset registers.

The hardware ensures that the system does not crash when the memory
is accessed with the memory enable turned off. It synthesis ~0 reads
and dropped writes on such access. So there is no need to support the
disablement/enablement of BAR through PCI_COMMAND config space register.

The memory layout on the host looks like the following:
   devmem (memlength)
|--|
|-cached|--NC--|
|   |
usemem.memphys  resmem.memphys

PCI BARs need to be aligned to the power-of-2, but the actual memory on the
device may not. A read or write access to the physical address from the
last device PFN up to the next power-of-2 aligned physical address
results in reading ~0 and dropped writes. Note that the GPU device
driver [6] is capable of knowing the exact device memory size through
separate means. The device memory size is primarily kept in the system
ACPI tables for use by the VFIO PCI variant module.

Note that the usemem memory is added by the VM Nvidia device driver [5]
to the VM kernel as memblocks. Hence make the usable memory size memblock
(MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and
VFIO driver. The VM device driver make use of the same value for its
calculation to determine USEMEM size.

Currently there is no provision in KVM for a S2 mapping with
MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
As previously mentioned, resmem is mapped pgprot_writecombine(), that
sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
proposed changes in [3] and [4], KVM marks the region with
MemAttr[2:0]=0b101 in S2.

If the device memory properties are not present, the driver registers the
vfio-pci-core function pointers. Since there are no ACPI memory properties
generated for the VM, the variant driver inside the VM will only use
the vfio-pci-core ops

[PATCH v19 2/3] vfio/pci: rename and export range_intersect_range

2024-02-20 Thread ankita

From: Ankit Agrawal 

range_intersect_range determines an overlap between two ranges. If an
overlap, the helper function returns the overlapping offset and size.

The VFIO PCI variant driver emulates the PCI config space BAR offset
registers. These offset may be accessed for read/write with a variety
of lengths including sub-word sizes from sub-word offsets. The driver
makes use of this helper function to read/write the targeted part of
the emulated register.

Make this a vfio_pci_core function, rename and export as GPL. Also
update references in virtio driver.

Reviewed-by: Kevin Tian 
Reviewed-by: Yishai Hadas 
Signed-off-by: Ankit Agrawal 
---
 drivers/vfio/pci/vfio_pci_config.c | 42 +
 drivers/vfio/pci/virtio/main.c | 72 +++---
 include/linux/vfio_pci_core.h  |  5 +++
 3 files changed, 73 insertions(+), 46 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_config.c 
b/drivers/vfio/pci/vfio_pci_config.c
index 672a1804af6a..971e32bc0bb4 100644
--- a/drivers/vfio/pci/vfio_pci_config.c
+++ b/drivers/vfio/pci/vfio_pci_config.c
@@ -1966,3 +1966,45 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device 
*vdev, char __user *buf,
 
return done;
 }
+
+/**
+ * vfio_pci_core_range_intersect_range() - Determine overlap between a buffer
+ *and register offset ranges.
+ * @buf_start: start offset of the buffer
+ * @buf_cnt:   number of buffer bytes
+ * @reg_start: start register offset
+ * @reg_cnt:   number of register bytes
+ * @buf_offset:start offset of overlap in the buffer
+ * @intersect_count:   number of overlapping bytes
+ * @register_offset:   start offset of overlap in register
+ *
+ * Returns: true if there is overlap, false if not.
+ * The overlap start and size is returned through function args.
+ */
+bool vfio_pci_core_range_intersect_range(loff_t buf_start, size_t buf_cnt,
+loff_t reg_start, size_t reg_cnt,
+loff_t *buf_offset,
+size_t *intersect_count,
+size_t *register_offset)
+{
+   if (buf_start <= reg_start &&
+   buf_start + buf_cnt > reg_start) {
+   *buf_offset = reg_start - buf_start;
+   *intersect_count = min_t(size_t, reg_cnt,
+buf_start + buf_cnt - reg_start);
+   *register_offset = 0;
+   return true;
+   }
+
+   if (buf_start > reg_start &&
+   buf_start < reg_start + reg_cnt) {
+   *buf_offset = 0;
+   *intersect_count = min_t(size_t, buf_cnt,
+reg_start + reg_cnt - buf_start);
+   *register_offset = buf_start - reg_start;
+   return true;
+   }
+
+   return false;
+}
+EXPORT_SYMBOL_GPL(vfio_pci_core_range_intersect_range);
diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c
index d5af683837d3..b5d3a8c5bbc9 100644
--- a/drivers/vfio/pci/virtio/main.c
+++ b/drivers/vfio/pci/virtio/main.c
@@ -132,33 +132,6 @@ virtiovf_pci_bar0_rw(struct virtiovf_pci_core_device 
*virtvdev,
return ret ? ret : count;
 }
 
-static bool range_intersect_range(loff_t range1_start, size_t count1,
- loff_t range2_start, size_t count2,
- loff_t *start_offset,
- size_t *intersect_count,
- size_t *register_offset)
-{
-   if (range1_start <= range2_start &&
-   range1_start + count1 > range2_start) {
-   *start_offset = range2_start - range1_start;
-   *intersect_count = min_t(size_t, count2,
-range1_start + count1 - range2_start);
-   *register_offset = 0;
-   return true;
-   }
-
-   if (range1_start > range2_start &&
-   range1_start < range2_start + count2) {
-   *start_offset = 0;
-   *intersect_count = min_t(size_t, count1,
-range2_start + count2 - range1_start);
-   *register_offset = range1_start - range2_start;
-   return true;
-   }
-
-   return false;
-}
-
 static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev,
char __user *buf, size_t count,
loff_t *ppos)
@@ -178,16 +151,18 @@ static ssize_t virtiovf_pci_read_config(struct 
vfio_device *core_vdev,
if (ret < 0)
return ret;
 
-   if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16),
- _offset, _count, _offset)) 
{
+   if (vfio_pci_core_range_intersect_range(pos, count, PCI_DEVICE_ID,
+

[PATCH v19 1/3] vfio/pci: rename and export do_io_rw()

2024-02-20 Thread ankita

From: Ankit Agrawal 

do_io_rw() is used to read/write to the device MMIO. The grace hopper
VFIO PCI variant driver require this functionality to read/write to
its memory.

Rename this as vfio_pci_core functions and export as GPL.

Reviewed-by: Kevin Tian 
Reviewed-by: Yishai Hadas 
Signed-off-by: Ankit Agrawal 
---
 drivers/vfio/pci/vfio_pci_rdwr.c | 16 +---
 include/linux/vfio_pci_core.h|  5 -
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c
index 07fea08ea8a2..03b8f7ada1ac 100644
--- a/drivers/vfio/pci/vfio_pci_rdwr.c
+++ b/drivers/vfio/pci/vfio_pci_rdwr.c
@@ -96,10 +96,10 @@ VFIO_IOREAD(32)
  * reads with -1.  This is intended for handling MSI-X vector tables and
  * leftover space for ROM BARs.
  */
-static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem,
-   void __iomem *io, char __user *buf,
-   loff_t off, size_t count, size_t x_start,
-   size_t x_end, bool iswrite)
+ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool 
test_mem,
+  void __iomem *io, char __user *buf,
+  loff_t off, size_t count, size_t x_start,
+  size_t x_end, bool iswrite)
 {
ssize_t done = 0;
int ret;
@@ -201,6 +201,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, 
bool test_mem,
 
return done;
 }
+EXPORT_SYMBOL_GPL(vfio_pci_core_do_io_rw);
 
 int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar)
 {
@@ -279,8 +280,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, 
char __user *buf,
x_end = vdev->msix_offset + vdev->msix_size;
}
 
-   done = do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos,
-   count, x_start, x_end, iswrite);
+   done = vfio_pci_core_do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, 
buf, pos,
+ count, x_start, x_end, iswrite);
 
if (done >= 0)
*ppos += done;
@@ -348,7 +349,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, 
char __user *buf,
 * probing, so we don't currently worry about access in relation
 * to the memory enable bit in the command register.
 */
-   done = do_io_rw(vdev, false, iomem, buf, off, count, 0, 0, iswrite);
+   done = vfio_pci_core_do_io_rw(vdev, false, iomem, buf, off, count,
+ 0, 0, iswrite);
 
vga_put(vdev->pdev, rsrc);
 
diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h
index 85e84b92751b..cf9480a31f3e 100644
--- a/include/linux/vfio_pci_core.h
+++ b/include/linux/vfio_pci_core.h
@@ -130,7 +130,10 @@ void vfio_pci_core_finish_enable(struct 
vfio_pci_core_device *vdev);
 int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar);
 pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev,
pci_channel_state_t state);
-
+ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool 
test_mem,
+  void __iomem *io, char __user *buf,
+  loff_t off, size_t count, size_t x_start,
+  size_t x_end, bool iswrite);
 #define VFIO_IOWRITE_DECLATION(size) \
 int vfio_pci_core_iowrite##size(struct vfio_pci_core_device *vdev, \
bool test_mem, u##size val, void __iomem *io);
-- 
2.34.1

[PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper

2024-02-20 Thread ankita

From: Ankit Agrawal 

NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device
for the on-chip GPU that is the logical OS representation of the
internal proprietary chip-to-chip cache coherent interconnect.

The device is peculiar compared to a real PCI device in that whilst
there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the
device, it is not used to access device memory once the faster
chip-to-chip interconnect is initialized (occurs at the time of host
system boot). The device memory is accessed instead using the
chip-to-chip interconnect that is exposed as a contiguous physically
addressable region on the host. Since the device memory is cache
coherent with the CPU, it can be mmaped into the user VMA with a
cacheable mapping and used like a regular RAM. The device memory is
not added to the host kernel, but mapped directly as this reduces
memory wastage due to struct pages.

There is also a requirement of a minimum reserved 1G uncached region
(termed as resmem) to support the Multi-Instance GPU (MIG) feature [1].
This is to work around a HW defect. Based on [2], the requisite properties
(uncached, unaligned access) can be achieved through a VM mapping (S1)
of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide
a different non-cached property to the reserved 1G region, it needs to
be carved out from the device memory and mapped as a separate region
in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets
the Qemu VMA page properties (pgprot) as NORMAL_NC.

Provide a VFIO PCI variant driver that adapts the unique device memory
representation into a more standard PCI representation facing userspace.

The variant driver exposes these two regions - the non-cached reserved
(resmem) and the cached rest of the device memory (termed as usemem) as
separate VFIO 64b BAR regions. This is divergent from the baremetal
approach, where the device memory is exposed as a device memory region.
The decision for a different approach was taken in view of the fact that
it would necessiate additional code in Qemu to discover and insert those
regions in the VM IPA, along with the additional VM ACPI DSDT changes to
communiate the device memory region IPA to the VM workloads. Moreover,
this behavior would have to be added to a variety of emulators (beyond
top of tree Qemu) out there desiring grace hopper support.

Since the device implements 64-bit BAR0, the VFIO PCI variant driver
maps the uncached carved out region to the next available PCI BAR (i.e.
comprising of region 2 and 3). The cached device memory aperture is
assigned BAR region 4 and 5. Qemu will then naturally generate a PCI
device in the VM with the uncached aperture reported as BAR2 region,
the cacheable as BAR4. The variant driver provides emulation for these
fake BARs' PCI config space offset registers.

The hardware ensures that the system does not crash when the memory
is accessed with the memory enable turned off. It synthesis ~0 reads
and dropped writes on such access. So there is no need to support the
disablement/enablement of BAR through PCI_COMMAND config space register.

The memory layout on the host looks like the following:
   devmem (memlength)
|--|
|-cached|--NC--|
|   |
usemem.memphys  resmem.memphys

PCI BARs need to be aligned to the power-of-2, but the actual memory on the
device may not. A read or write access to the physical address from the
last device PFN up to the next power-of-2 aligned physical address
results in reading ~0 and dropped writes. Note that the GPU device
driver [6] is capable of knowing the exact device memory size through
separate means. The device memory size is primarily kept in the system
ACPI tables for use by the VFIO PCI variant module.

Note that the usemem memory is added by the VM Nvidia device driver [5]
to the VM kernel as memblocks. Hence make the usable memory size memblock
(MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and
VFIO driver. The VM device driver make use of the same value for its
calculation to determine USEMEM size.

Currently there is no provision in KVM for a S2 mapping with
MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3].
As previously mentioned, resmem is mapped pgprot_writecombine(), that
sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the
proposed changes in [3] and [4], KVM marks the region with
MemAttr[2:0]=0b101 in S2.

If the device memory properties are not present, the driver registers the
vfio-pci-core function pointers. Since there are no ACPI memory properties
generated for the VM, the variant driver inside the VM will only use
the vfio-pci-core ops and hence try to map the BARs as non cached. This
is not a problem as the CPUs have FWB enabled which blocks the VM
mapping's ability to override the

Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup

2024-02-20 Thread Huang, Kai


> +/*
> + * Get the lower bound of limits of a cgroup and its ancestors.  Used in
> + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup 
> is
> + * over its limit or its ancestors' hence reclamation is needed.
> + */
> +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup 
> *epc_cg)
> +{
> + struct misc_cg *i = epc_cg->cg;
> + u64 m = U64_MAX;
> +
> + while (i) {
> + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max));
> + i = misc_cg_parent(i);
> + }
> +
> + return m / PAGE_SIZE;
> +}

I am not sure, but is it possible or legal for an ancestor to have less limit
than children?

> +
>  /**
> - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page
> + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its 
> LRUs
> + * @root:Root of the tree to check
>   *
> + * Return: %true if all cgroups under the specified root have empty LRU 
> lists.
> + * Used to avoid livelocks due to a cgroup having a non-zero charge count but
> + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or
> + * because all pages in the cgroup are unreclaimable.
> + */
> +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root)
> +{
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + bool ret = true;
> +
> + /*
> +  * Caller ensure css_root ref acquired
> +  */
> + css_root = >css;
> +
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> +
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> +
> + spin_lock(_cg->lru.lock);
> + ret = list_empty(_cg->lru.reclaimable);
> + spin_unlock(_cg->lru.lock);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!ret)
> + break;
> + }
> +
> + rcu_read_unlock();
> +
> + return ret;
> +}
> +
> +/**
> + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to 
> reclaim pages
> + * @root:Root of the tree to start walking from.
> + * Return:   Number of pages reclaimed.

Just wondering, do you need to return @cnt given this function is called w/o
checking the return value?

> + */
> +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root)
> +{
> + /*
> +  * Attempting to reclaim only a few pages will often fail and is
> +  * inefficient, while reclaiming a huge number of pages can result in
> +  * soft lockups due to holding various locks for an extended duration.
> +  */

Not sure we need this comment, given it's already implied in
sgx_reclaim_pages().  You cannot pass a value > SGX_NR_TO_SCAN anyway.

> + unsigned int nr_to_scan = SGX_NR_TO_SCAN;
> + struct cgroup_subsys_state *css_root;
> + struct cgroup_subsys_state *pos;
> + struct sgx_epc_cgroup *epc_cg;
> + unsigned int cnt;
> +
> +  /* Caller ensure css_root ref acquired */
> + css_root = >css;
> +
> + cnt = 0;
> + rcu_read_lock();
> + css_for_each_descendant_pre(pos, css_root) {
> + if (!css_tryget(pos))
> + break;
> + rcu_read_unlock();
> +
> + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos));
> + cnt += sgx_reclaim_pages(_cg->lru, _to_scan);
> +
> + rcu_read_lock();
> + css_put(pos);
> + if (!nr_to_scan)
> + break;
> + }
> +
> + rcu_read_unlock();
> + return cnt;
> +}

Here the @nr_to_scan is reduced by the number of pages that are isolated, but
not actually reclaimed (which is reflected by @cnt).

IIUC, looks you want to make this function do "each cycle" as what you mentioned
in the v8 [1]:

I tested with that approach and found we can only target number of
pages  
attempted to reclaim not pages actually reclaimed due to the
uncertainty  
of how long it takes to reclaim pages. Besides targeting number of
scanned pages for each cycle is also what the ksgxd does.

If we target actual number of pages, sometimes it just takes too long.
I
saw more timeouts with the default time limit when running parallel  
selftests.

I am not sure what does "sometimes it just takes too long" mean, but what I am
thinking is you are trying to do some perfect but yet complicated code here.

For instance, I don't think selftest reflect the real workload, and I believe
adjusting the limit of a given EPC cgroup shouldn't be a frequent operation,
thus it is acceptable to use some easy-maintain code but less perfect code.

Here I still think having @nr_to_scan as a pointer is over-complicated.  For
example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN
pages, but give up

Re: [PATCH v9 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup

2024-02-20 Thread Huang, Kai

On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote:
> From: Sean Christopherson 
> 
> Each EPC cgroup will have an LRU structure to track reclaimable EPC pages.
> When a cgroup usage reaches its limit, the cgroup needs to reclaim pages
> from its LRU or LRUs of its descendants to make room for any new
> allocations.
> 
> To prepare for reclamation per cgroup, expose the top level reclamation
> function, sgx_reclaim_pages(), in header file for reuse. Add a parameter
> to the function to pass in an LRU so cgroups can pass in different
> tracking LRUs later.  
> 

[...]

> Add another parameter for passing in the number of
> pages to scan and make the function return the number of pages reclaimed
> as a cgroup reclaimer may need to track reclamation progress from its
> descendants, change number of pages to scan in subsequent calls.

Firstly, sorry for late reply as I was away.  

From the changelog, it's understandable you want to make this function return
pages that are actually reclaimed, and perhaps it's also OK to pass the number
of pages to scan.

But this doesn't explain why you need to make @nr_to_scan as a pointer, while
you are returning the number of pages that are actually reclaimed?

And ...

[...]

> -/*
> - * Take a fixed number of pages from the head of the active page pool and
> - * reclaim them to the enclave's private shmem files. Skip the pages, which 
> have
> - * been accessed since the last scan. Move those pages to the tail of active
> - * page pool so that the pages get scanned in LRU like fashion.
> +/**
> + * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU
> + *
> + * Take a fixed number of pages from the head of a given LRU and reclaim 
> them to
> + * the enclave's private shmem files. Skip the pages, which have been 
> accessed
> + * since the last scan. Move those pages to the tail of the list so that the
> + * pages get scanned in LRU like fashion.
>   *
>   * Batch process a chunk of pages (at the moment 16) in order to degrade 
> amount

... there's no comment to explain such design either (@nr_to_scan as a pointer).

Btw, with this change, seems "Take a fixed number of pages ..." and "at the
moment 16" are not accurate any more.

>   * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a 
> bit
> @@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page 
> *epc_page,
>   * + EWB) but not sufficiently. Reclaiming one page at a time would also be
>   * problematic as it would increase the lock contention too much, which would
>   * halt forward progress.
> + *
> + * @lru: The LRU from which pages are reclaimed.
> + * @nr_to_scan: Pointer to the target number of pages to scan, must be less 
> than
> + *   SGX_NR_TO_SCAN.
> + * Return:   Number of pages reclaimed.
>   */
> -static void sgx_reclaim_pages(void)
> +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int 
> *nr_to_scan)
>  {
>   struct sgx_epc_page *chunk[SGX_NR_TO_SCAN];
>   struct sgx_backing backing[SGX_NR_TO_SCAN];
> @@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void)
>   int ret;
>   int i;
>  
> - spin_lock(_global_lru.lock);
> - for (i = 0; i < SGX_NR_TO_SCAN; i++) {
> - epc_page = list_first_entry_or_null(_global_lru.reclaimable,
> - struct sgx_epc_page, list);
> + spin_lock(>lock);
> +
> + for (; *nr_to_scan > 0; --(*nr_to_scan)) {
> + epc_page = list_first_entry_or_null(>reclaimable, struct 
> sgx_epc_page, list);
>   if (!epc_page)
>   break;
>  
> @@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void)
>*/
>   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>   }
> - spin_unlock(_global_lru.lock);
> +
> + spin_unlock(>lock);
>  
>   for (i = 0; i < cnt; i++) {
>   epc_page = chunk[i];
> @@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void)
>   continue;
>  
>  skip:
> - spin_lock(_global_lru.lock);
> - list_add_tail(_page->list, _global_lru.reclaimable);
> - spin_unlock(_global_lru.lock);
> + spin_lock(>lock);
> + list_add_tail(_page->list, >reclaimable);
> + spin_unlock(>lock);
>  
>   kref_put(_page->encl->refcount, sgx_encl_release);
>  
> @@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void)
>   sgx_reclaimer_block(epc_page);
>   }
>  
> + ret = 0;
>   for (i = 0; i < cnt; i++) {
>   epc_page = chunk[i];
>   if (!epc_page)
> @@ -378,7 +387,10 @@ static void sgx_reclaim_pages(void)
>   epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED;
>  
>   sgx_free_epc_page(epc_page);
> + ret++;
>   }
> +
> + return (unsigned int)ret;
>  }
>  

Here basically the @nr_to_scan is reduced by the number of pages that are

Re: [PATCH] vduse: implement DMA sync callbacks

2024-02-20 Thread Christoph Hellwig

On Mon, Feb 19, 2024 at 06:06:06PM +0100, Maxime Coquelin wrote:
> Since commit 295525e29a5b ("virtio_net: merge dma
> operations when filling mergeable buffers"), VDUSE device
> require support for DMA's .sync_single_for_cpu() operation
> as the memory is non-coherent between the device and CPU
> because of the use of a bounce buffer.
> 
> This patch implements both .sync_single_for_cpu() and
> sync_single_for_device() callbacks, and also skip bounce
> buffer copies during DMA map and unmap operations if the
> DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra
> copies of the same buffer.

vduse really needs to get out of implementing fake DMA operations for
something that is not DMA.

Re: WARNING in shmem_release_dquot

2024-02-20 Thread Carlos Maiolino

On Mon, Feb 19, 2024 at 08:26:20PM -0800, Hugh Dickins wrote:
> On Mon, 29 Jan 2024, Ubisectech Sirius wrote:
> 
> > Hello.
> > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. 
> > Recently, our team has discovered a issue in Linux kernel 
> > 6.8.0-rc1-gecb1b8288dc7. Attached to the email were a POC file of the issue.
> >
> > Stack dump:
> > [  246.195553][ T4096] [ cut here ]
> > [  246.196540][ T4096] quota id 16384 from dquot 888051bd3000, not in 
> > rb tree!
> > [ 246.198829][ T4096] WARNING: CPU: 1 PID: 4096 at mm/shmem_quota.c:290 
> > shmem_release_dquot (mm/shmem_quota.c:290 (discriminator 3))
> > [  246.199955][ T4096] Modules linked in:
> > [  246.200435][ T4096] CPU: 1 PID: 4096 Comm: kworker/u6:6 Not tainted 
> > 6.8.0-rc1-gecb1b8288dc7 #21
> > [  246.201566][ T4096] Hardware name: QEMU Standard PC (i440FX + PIIX, 
> > 1996), BIOS 1.15.0-1 04/01/2014
> > [  246.202667][ T4096] Workqueue: events_unbound quota_release_workfn
> > [ 246.203516][ T4096] RIP: 0010:shmem_release_dquot (mm/shmem_quota.c:290 
> > (discriminator 3))
> > [ 246.204276][ T4096] Code: e8 28 d9 18 00 e9 b3 f8 ff ff e8 6e e1 c2 ff c6 
> > 05 bf e8 1b 0d 01 90 48 c7 c7 80 f0 b8 8a 4c 89 ea 44 89 e6 e8 14 6d 89 ff 
> > 90 <0f> 0b 90 90 e9 18 fb ff ff e8 f5 d8 18 00 e9 a2 fa ff ff e8 0b d9
> > All code
> > 
> >0:   e8 28 d9 18 00  call   0x18d92d
> >5:   e9 b3 f8 ff ff  jmp0xf8bd
> >a:   e8 6e e1 c2 ff  call   0xffc2e17d
> >f:   c6 05 bf e8 1b 0d 01movb   $0x1,0xd1be8bf(%rip)# 
> > 0xd1be8d5
> >   16:   90  nop
> >   17:   48 c7 c7 80 f0 b8 8amov$0x8ab8f080,%rdi
> >   1e:   4c 89 eamov%r13,%rdx
> >   21:   44 89 e6mov%r12d,%esi
> >   24:   e8 14 6d 89 ff  call   0xff896d3d
> >   29:   90  nop
> >   2a:*  0f 0b   ud2 <-- trapping instruction
> >   2c:   90  nop
> >   2d:   90  nop
> >   2e:   e9 18 fb ff ff  jmp0xfb4b
> >   33:   e8 f5 d8 18 00  call   0x18d92d
> >   38:   e9 a2 fa ff ff  jmp0xfadf
> >   3d:   e8  .byte 0xe8
> >   3e:   0b d9   or %ecx,%ebx
> >
> > Code starting with the faulting instruction
> > ===
> >0:   0f 0b   ud2
> >2:   90  nop
> >3:   90  nop
> >4:   e9 18 fb ff ff  jmp0xfb21
> >9:   e8 f5 d8 18 00  call   0x18d903
> >e:   e9 a2 fa ff ff  jmp0xfab5
> >   13:   e8  .byte 0xe8
> >   14:   0b d9   or %ecx,%ebx
> > [  246.206640][ T4096] RSP: 0018:c9000604fbc0 EFLAGS: 00010286
> > [  246.207403][ T4096] RAX:  RBX:  RCX: 
> > 814c77da
> > [  246.208514][ T4096] RDX: 888049a58000 RSI: 814c77e7 RDI: 
> > 0001
> > [  246.209429][ T4096] RBP:  R08: 0001 R09: 
> > 
> > [  246.210362][ T4096] R10: 0001 R11: 0001 R12: 
> > 4000
> > [  246.211367][ T4096] R13: 888051bd3000 R14: dc00 R15: 
> > 888051bd3040
> > [  246.212327][ T4096] FS:  () 
> > GS:88807ec0() knlGS:
> > [  246.213387][ T4096] CS:  0010 DS:  ES:  CR0: 80050033
> > [  246.214232][ T4096] CR2: 7ffee748ec80 CR3: 0cb78000 CR4: 
> > 00750ef0
> > [  246.215216][ T4096] DR0:  DR1:  DR2: 
> > 
> > [  246.216187][ T4096] DR3:  DR6: fffe0ff0 DR7: 
> > 0400
> > [  246.217148][ T4096] PKRU: 5554
> > [  246.217615][ T4096] Call Trace:
> > [  246.218090][ T4096]  
> > [ 246.218467][ T4096] ? show_regs (arch/x86/kernel/dumpstack.c:479)
> > [ 246.218979][ T4096] ? __warn (kernel/panic.c:677)
> > [ 246.219505][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 
> > (discriminator 3))
> > [ 246.220197][ T4096] ? report_bug (lib/bug.c:201 lib/bug.c:219)
> > [ 246.220775][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 
> > (discriminator 3))
> > [ 246.221500][ T4096] ? handle_bug (arch/x86/kernel/traps.c:238)
> > [ 246.222081][ T4096] ? exc_invalid_op (arch/x86/kernel/traps.c:259 
> > (discriminator 1))
> > [ 246.222687][ T4096] ? asm_exc_invalid_op 
> > (./arch/x86/include/asm/idtentry.h:568)
> > [ 246.223296][ T4096] ? __warn_printk 
> > (./include/linux/context_tracking.h:155 kernel/panic.c:726)
> > [ 246.223878][ T4096] ? __warn_printk (kernel/panic.c:717)
> > [ 246.224460][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 
> > (discriminator 3))
> > [ 246.225125][ T4096] quota_release_workfn

40 matches

Mail list logo