Re: [PATCH v2] mm: Update mark_victim tracepoints fields
Hi, sorry I have missed this before. On Thu 11-01-24 21:05:30, Carlos Galo wrote: > The current implementation of the mark_victim tracepoint provides only > the process ID (pid) of the victim process. This limitation poses > challenges for userspace tools that need additional information > about the OOM victim. The association between pid and the additional > data may be lost after the kill, making it difficult for userspace to > correlate the OOM event with the specific process. You are correct that post OOM all per-process information is lost. On the other hand we do dump all this information to the kernel log. Could you explain why that is not suitable for your purpose? > In order to mitigate this limitation, add the following fields: > > - UID >In Android each installed application has a unique UID. Including >the `uid` assists in correlating OOM events with specific apps. > > - Process Name (comm) >Enables identification of the affected process. > > - OOM Score >Allows userspace to get additional insights of the relative kill >priority of the OOM victim. What is the oom score useful for? Is there any reason to provide a different information from the one reported to the kernel log? __oom_kill_process: pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB, UID:%u pgtables:%lukB oom_score_adj:%hd\n", message, task_pid_nr(victim), victim->comm, K(mm->total_vm), K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES)), from_kuid(_user_ns, task_uid(victim)), mm_pgtables_bytes(mm) >> 10, victim->signal->oom_score_adj); -- Michal Hocko SUSE Labs
Re: [PATCH -next v6 0/2] Make memory reclamation measurable
On Wed 21-02-24 11:00:53, Bixuan Cui wrote: > > > 在 2024/2/21 10:22, Steven Rostedt 写道: > > It's up to the memory management folks to decide on this. -- Steve > Noted with thanks. It would be really helpful to have more details on why we need those trace points. It is my understanding that you would like to have a more fine grained numbers for the time duration of different parts of the reclaim process. I can imagine this could be useful in some cases but is it useful enough and for a wider variety of workloads? Is that worth a dedicated static tracepoints? Why an add-hoc dynamic tracepoints or BPF for a very special situation is not sufficient? In other words, tell us more about the usecases and why is this generally useful. Thanks! -- Michal Hocko SUSE Labs
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
[...] Here the @nr_to_scan is reduced by the number of pages that are isolated, but not actually reclaimed (which is reflected by @cnt). IIUC, looks you want to make this function do "each cycle" as what you mentioned in the v8 [1]: I tested with that approach and found we can only target number of pages attempted to reclaim not pages actually reclaimed due to the uncertainty of how long it takes to reclaim pages. Besides targeting number of scanned pages for each cycle is also what the ksgxd does. If we target actual number of pages, sometimes it just takes too long. I saw more timeouts with the default time limit when running parallel selftests. I am not sure what does "sometimes it just takes too long" mean, but what I am thinking is you are trying to do some perfect but yet complicated code here. I think what I observed was that the try_charge() would block too long before getting chance of schedule() to yield, causing more timeouts than necessary. I'll do some re-test to be sure. For instance, I don't think selftest reflect the real workload, and I believe adjusting the limit of a given EPC cgroup shouldn't be a frequent operation, thus it is acceptable to use some easy-maintain code but less perfect code. Here I still think having @nr_to_scan as a pointer is over-complicated. For example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN pages, but give up when there's enough pages reclaimed or when the EPC cgroup and its descendants have been looped: unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) { unsigned int cnt = 0; ... css_for_each_descendant_pre(pos, css_root) { ... epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); cnt += sgx_reclaim_pages(_cg->lru); if (cnt >= SGX_NR_TO_SCAN) break; } ... return cnt; } Yeah it may reclaim more than SGX_NR_TO_SCAN when the loop actually reaches any descendants, but that should be rare and we don't care that much, do we? I assume you meant @cnt here to be number of pages actually reclaimed. This could cause sgx_epc_cgroup_reclaim_pages() block too long as @cnt may always be zero (all pages are too young) and you have to loop all descendants. Thanks Haitao
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
StartHi Kai On Tue, 20 Feb 2024 03:52:39 -0600, Huang, Kai wrote: [...] So you introduced the work/workqueue here but there's no place which actually queues the work. IMHO you can either: 1) move relevant code change here; or 2) focus on introducing core functions to reclaim certain pages from a given EPC cgroup w/o workqueue and introduce the work/workqueue in later patch. Makes sense? Starting in v7, I was trying to split the big patch, #10 in v6 as you and others suggested. My thought process was to put infrastructure needed for per-cgroup reclaim in the front, then turn on per-cgroup reclaim in [v9 13/15] in the end. Before that, all reclaimables are tracked in the global LRU so really there is no "reclaim certain pages from a given EPC cgroup w/o workqueue" or reclaim through workqueue before that point, as suggested in #2. This patch puts down the implementation for both flows but neither used yet, as stated in the commit message. #1 would force me go back and merge the patches again. Sorry I feel kind of lost on this whole thing by now. It seems so random to me. Is there hard rules on this? I was hoping these statements would help reviewers on the flow of the patches. At the end of [v9 04/15]: For now, the EPC cgroup simply blocks additional EPC allocation in sgx_alloc_epc_page() when the limit is reached. Reclaimable pages are still tracked in the global active list, only reclaimed by the global reclaimer when the total free page count is lower than a threshold. Later patches will reorganize the tracking and reclamation code in the global reclaimer and implement per-cgroup tracking and reclaiming. At the end of [v9 06/15]: Next patches will first get the cgroup reclamation flow ready while keeping pages tracked in the global LRU and reclaimed by ksgxd before we make the switch in the end for sgx_lru_list() to return per-cgroup LRU. At the end of [v9 08/15]: Both synchronous and asynchronous flows invoke the same top level reclaim function, and will be triggered later by sgx_epc_cgroup_try_charge() when usage of the cgroup is at or near its limit. At the end of [v9 10/15]: Note at this point, all reclaimable EPC pages are still tracked in the global LRU and per-cgroup LRUs are empty. So no per-cgroup reclamation is activated yet. Thanks Haitao
Re: [PATCH] bus: mhi: host: Change the trace string for the userspace tools mapping
On Sun, Feb 18, 2024 at 02:13:39PM +0530, Krishna chaitanya chundru wrote: > User space tools can't map strings if we use directly, as the string > address is internal to kernel. > > So add trace point strings for the user space tools to map strings > properly. > > Signed-off-by: Krishna chaitanya chundru Reported-by: Steven Rostedt Reviewed-by: Manivannan Sadhasivam - Mani > --- > drivers/bus/mhi/host/main.c | 4 ++-- > drivers/bus/mhi/host/trace.h | 2 ++ > 2 files changed, 4 insertions(+), 2 deletions(-) > > diff --git a/drivers/bus/mhi/host/main.c b/drivers/bus/mhi/host/main.c > index 2d38f6005da6..15d657af9b5b 100644 > --- a/drivers/bus/mhi/host/main.c > +++ b/drivers/bus/mhi/host/main.c > @@ -1340,7 +1340,7 @@ static int mhi_update_channel_state(struct > mhi_controller *mhi_cntrl, > enum mhi_cmd_type cmd = MHI_CMD_NOP; > int ret; > > - trace_mhi_channel_command_start(mhi_cntrl, mhi_chan, to_state, > "Updating"); > + trace_mhi_channel_command_start(mhi_cntrl, mhi_chan, to_state, > TPS("Updating")); > switch (to_state) { > case MHI_CH_STATE_TYPE_RESET: > write_lock_irq(_chan->lock); > @@ -1407,7 +1407,7 @@ static int mhi_update_channel_state(struct > mhi_controller *mhi_cntrl, > write_unlock_irq(_chan->lock); > } > > - trace_mhi_channel_command_end(mhi_cntrl, mhi_chan, to_state, "Updated"); > + trace_mhi_channel_command_end(mhi_cntrl, mhi_chan, to_state, > TPS("Updated")); > exit_channel_update: > mhi_cntrl->runtime_put(mhi_cntrl); > mhi_device_put(mhi_cntrl->mhi_dev); > diff --git a/drivers/bus/mhi/host/trace.h b/drivers/bus/mhi/host/trace.h > index d12a98d44272..368515dcb22d 100644 > --- a/drivers/bus/mhi/host/trace.h > +++ b/drivers/bus/mhi/host/trace.h > @@ -84,6 +84,8 @@ DEV_ST_TRANSITION_LIST > #define dev_st_trans(a, b) { DEV_ST_TRANSITION_##a, b }, > #define dev_st_trans_end(a, b) { DEV_ST_TRANSITION_##a, b } > > +#define TPS(x) tracepoint_string(x) > + > TRACE_EVENT(mhi_gen_tre, > > TP_PROTO(struct mhi_controller *mhi_cntrl, struct mhi_chan *mhi_chan, > > --- > base-commit: ceeb64f41fe6a1eb9fc56d583983a81f8f3dd058 > change-id: 20240218-ftrace_string-7677762aa63c > > Best regards, > -- > Krishna chaitanya chundru > -- மணிவண்ணன் சதாசிவம்
Re: [PATCH v3] modules: wait do_free_init correctly
On Sun, Feb 18, 2024 at 01:21:53PM -0800, Andrew Morton wrote: > On Sat, 17 Feb 2024 16:18:10 +0800 Changbin Du wrote: > > The synchronization here is just to ensure the module init's been freed > > before doing W+X checking. But the commit 1a7b7d922081 ("modules: Use > > vmalloc special flag") moves do_free_init() into a global workqueue > > instead of call_rcu(). So now rcu_barrier() can not ensure that do_free_init > > has completed. We should wait it via flush_work(). > > > > Without this fix, we still could encounter false positive reports in > > W+X checking, and the rcu synchronization is unnecessary which can > > introduce significant delay. > > > > Eric Chanudet reports that the rcu_barrier introduces ~0.1s delay on a > > PREEMPT_RT kernel. > > [0.291444] Freeing unused kernel memory: 5568K > > [0.402442] Run /sbin/init as init process > > > > With this fix, the above delay can be eliminated. > > Thanks, I'll queue this as a delta, to be folded into the base patch > prior to upstreaming. > > I added a Tested-by: Eric, if that's OK by him? Absolutely, I should have put it in my initial reply. Adding here as confirmation: Tested-by: Eric Chanudet Thanks, -- Eric Chanudet
Re: [PATCH -next v6 0/2] Make memory reclamation measurable
在 2024/2/21 10:22, Steven Rostedt 写道: It's up to the memory management folks to decide on this. -- Steve Noted with thanks. Bixuan Cui
Re: [PATCH -next v6 0/2] Make memory reclamation measurable
On Wed, 21 Feb 2024 09:44:32 +0800 Bixuan Cui wrote: > ping~ > It's up to the memory management folks to decide on this. -- Steve > 在 2024/1/5 9:36, Bixuan Cui 写道: > > When the system memory is low, kswapd reclaims the memory. The key steps > > of memory reclamation include > > 1.shrink_lruvec > >* shrink_active_list, moves folios from the active LRU to the inactive > > LRU > >* shrink_inactive_list, shrink lru from inactive LRU list > > 2.shrink_slab > >* shrinker->count_objects(), calculates the freeable memory > >* shrinker->scan_objects(), reclaims the slab memory > > > > The existing tracers in the vmscan are as follows: > > > > --do_try_to_free_pages > > --shrink_zones > > --trace_mm_vmscan_node_reclaim_begin (tracer) > > --shrink_node > > --shrink_node_memcgs > >--trace_mm_vmscan_memcg_shrink_begin (tracer) > >--shrink_lruvec > > --shrink_list > >--shrink_active_list > > --trace_mm_vmscan_lru_shrink_active (tracer) > >--shrink_inactive_list > > --trace_mm_vmscan_lru_shrink_inactive (tracer) > > --shrink_active_list > >--shrink_slab > > --do_shrink_slab > > --shrinker->count_objects() > > --trace_mm_shrink_slab_start (tracer) > > --shrinker->scan_objects() > > --trace_mm_shrink_slab_end (tracer) > >--trace_mm_vmscan_memcg_shrink_end (tracer) > > --trace_mm_vmscan_node_reclaim_end (tracer) > > > > If we get the duration and quantity of shrink lru and slab, > > then we can measure the memory recycling, as follows > > > > Measuring memory reclamation with bpf: > >LRU FILE: > > CPU COMMShrinkActive(us) ShrinkInactive(us) Reclaim(page) > > 7 kswapd0 26 51 32 > > 7 kswapd0 52 47 13 > >SLAB: > > CPU COMMOBJ_NAMECount_Dur(us) > > Freeable(page) Scan_Dur(us) Reclaim(page) > > 1 kswapd0 super_cache_scan.cfi_jt 2 341 > >3225 128 > > 7 kswapd0 super_cache_scan.cfi_jt 0 > > 2247 8524 1024 > > 7 kswapd0 super_cache_scan.cfi_jt 23670 > >00 > > > > For this, add the new tracer to shrink_active_list/shrink_inactive_list > > and shrinker->count_objects(). > > > > Changes: > > v6: * Add Reviewed-by from Steven Rostedt. > > v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to > > replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' > > * Add the explanation for adding new shrink lru events into 'mm: > > vmscan: add new event to trace shrink lru' > > v4: Add Reviewed-by and Changlog to every patch. > > v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the > > trace event. > > v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the > > same time to fix build error. > > > > cuibixuan (2): > >mm: shrinker: add new event to trace shrink count > >mm: vmscan: add new event to trace shrink lru > > > > include/trace/events/vmscan.h | 80 ++- > > mm/shrinker.c | 4 ++ > > mm/vmscan.c | 11 +++-- > > 3 files changed, 90 insertions(+), 5 deletions(-) > >
Re: [PATCH -next v6 0/2] Make memory reclamation measurable
ping~ 在 2024/1/5 9:36, Bixuan Cui 写道: When the system memory is low, kswapd reclaims the memory. The key steps of memory reclamation include 1.shrink_lruvec * shrink_active_list, moves folios from the active LRU to the inactive LRU * shrink_inactive_list, shrink lru from inactive LRU list 2.shrink_slab * shrinker->count_objects(), calculates the freeable memory * shrinker->scan_objects(), reclaims the slab memory The existing tracers in the vmscan are as follows: --do_try_to_free_pages --shrink_zones --trace_mm_vmscan_node_reclaim_begin (tracer) --shrink_node --shrink_node_memcgs --trace_mm_vmscan_memcg_shrink_begin (tracer) --shrink_lruvec --shrink_list --shrink_active_list --trace_mm_vmscan_lru_shrink_active (tracer) --shrink_inactive_list --trace_mm_vmscan_lru_shrink_inactive (tracer) --shrink_active_list --shrink_slab --do_shrink_slab --shrinker->count_objects() --trace_mm_shrink_slab_start (tracer) --shrinker->scan_objects() --trace_mm_shrink_slab_end (tracer) --trace_mm_vmscan_memcg_shrink_end (tracer) --trace_mm_vmscan_node_reclaim_end (tracer) If we get the duration and quantity of shrink lru and slab, then we can measure the memory recycling, as follows Measuring memory reclamation with bpf: LRU FILE: CPU COMMShrinkActive(us) ShrinkInactive(us) Reclaim(page) 7 kswapd0 26 51 32 7 kswapd0 52 47 13 SLAB: CPU COMMOBJ_NAMECount_Dur(us) Freeable(page) Scan_Dur(us) Reclaim(page) 1 kswapd0 super_cache_scan.cfi_jt 2 341 3225 128 7 kswapd0 super_cache_scan.cfi_jt 0 2247 8524 1024 7 kswapd0 super_cache_scan.cfi_jt 23670 00 For this, add the new tracer to shrink_active_list/shrink_inactive_list and shrinker->count_objects(). Changes: v6: * Add Reviewed-by from Steven Rostedt. v5: * Use 'DECLARE_EVENT_CLASS(mm_vmscan_lru_shrink_start_template' to replace 'RACE_EVENT(mm_vmscan_lru_shrink_inactive/active_start' * Add the explanation for adding new shrink lru events into 'mm: vmscan: add new event to trace shrink lru' v4: Add Reviewed-by and Changlog to every patch. v3: Swap the positions of 'nid' and 'freeable' to prevent the hole in the trace event. v2: Modify trace_mm_vmscan_lru_shrink_inactive() in evict_folios() at the same time to fix build error. cuibixuan (2): mm: shrinker: add new event to trace shrink count mm: vmscan: add new event to trace shrink lru include/trace/events/vmscan.h | 80 ++- mm/shrinker.c | 4 ++ mm/vmscan.c | 11 +++-- 3 files changed, 90 insertions(+), 5 deletions(-)
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On Tue, 20 Feb 2024 10:40:23 -0500 Steven Rostedt wrote: > > Try resetting the info->add_timestamp flags to add_ts_default on goto again > > within __rb_reserve_next(). > > > > I was looking at that too, but I don't know how it will make a difference. > > Note, the test that fails is in my test suite, and takes about a half hour > to get there. Running that suite takes up resources (it's my main test > suite for all changes). I'm currently testing other patches so I either > need to figure it out through inspection, or this will need to wait a while. I did a bit of debugging and it's just getting weird. It triggers when I run with the "perf" clock, which is the local_clock() (and this has CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y which does some strange paths for local_clock()). It appears that on function_graph tracing with perf clock, it fails the local_cmpxchg a lot! And for some reason, this is causing it to fail the overflow more often. Well, if it pushes the header page forward too, it will also cause that to go back and add to the counter. I tried moving the "again:" to the beginning of the function (after resetting the add_timestamp flags and length) so that it gets a new tail_page each time, but that doesn't appear to make a difference either. It looks like that retry loop in some configs causes it to fail allocation on the subbuffer over a 1000 times! Looks like this changes expectations a bit, and will need more design changes to make it work. -- Steve
[PATCH v18 5/6] Documentation: tracing: Add ring-buffer mapping
It is now possible to mmap() a ring-buffer to stream its content. Add some documentation and a code example. Signed-off-by: Vincent Donnefort diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst index 5092d6c13af5..0b300901fd75 100644 --- a/Documentation/trace/index.rst +++ b/Documentation/trace/index.rst @@ -29,6 +29,7 @@ Linux Tracing Technologies timerlat-tracer intel_th ring-buffer-design + ring-buffer-map stm sys-t coresight/index diff --git a/Documentation/trace/ring-buffer-map.rst b/Documentation/trace/ring-buffer-map.rst new file mode 100644 index ..0426ab4bcf3d --- /dev/null +++ b/Documentation/trace/ring-buffer-map.rst @@ -0,0 +1,106 @@ +.. SPDX-License-Identifier: GPL-2.0 + +== +Tracefs ring-buffer memory mapping +== + +:Author: Vincent Donnefort + +Overview + +Tracefs ring-buffer memory map provides an efficient method to stream data +as no memory copy is necessary. The application mapping the ring-buffer becomes +then a consumer for that ring-buffer, in a similar fashion to trace_pipe. + +Memory mapping setup + +The mapping works with a mmap() of the trace_pipe_raw interface. + +The first system page of the mapping contains ring-buffer statistics and +description. It is referred as the meta-page. One of the most important field of +the meta-page is the reader. It contains the sub-buffer ID which can be safely +read by the mapper (see ring-buffer-design.rst). + +The meta-page is followed by all the sub-buffers, ordered by ascendant ID. It is +therefore effortless to know where the reader starts in the mapping: + +.. code-block:: c + +reader_id = meta->reader->id; +reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; + +When the application is done with the current reader, it can get a new one using +the trace_pipe_raw ioctl() TRACE_MMAP_IOCTL_GET_READER. This ioctl also updates +the meta-page fields. + +Limitations +=== +When a mapping is in place on a Tracefs ring-buffer, it is not possible to +either resize it (either by increasing the entire size of the ring-buffer or +each subbuf). It is also not possible to use snapshot and causes splice to copy +the ring buffer data instead of using the copyless swap from the ring buffer. + +Concurrent readers (either another application mapping that ring-buffer or the +kernel with trace_pipe) are allowed but not recommended. They will compete for +the ring-buffer and the output is unpredictable, just like concurrent readers on +trace_pipe would be. + +Example +=== + +.. code-block:: c + +#include +#include +#include +#include + +#include + +#include +#include + +#define TRACE_PIPE_RAW "/sys/kernel/tracing/per_cpu/cpu0/trace_pipe_raw" + +int main(void) +{ +int page_size = getpagesize(), fd, reader_id; +unsigned long meta_len, data_len; +struct trace_buffer_meta *meta; +void *map, *reader, *data; + +fd = open(TRACE_PIPE_RAW, O_RDONLY | O_NONBLOCK); +if (fd < 0) +exit(EXIT_FAILURE); + +map = mmap(NULL, page_size, PROT_READ, MAP_SHARED, fd, 0); +if (map == MAP_FAILED) +exit(EXIT_FAILURE); + +meta = (struct trace_buffer_meta *)map; +meta_len = meta->meta_page_size; + +printf("entries:%llu\n", meta->entries); +printf("overrun:%llu\n", meta->overrun); +printf("read: %llu\n", meta->read); +printf("nr_subbufs: %u\n", meta->nr_subbufs); + +data_len = meta->subbuf_size * meta->nr_subbufs; +data = mmap(NULL, data_len, PROT_READ, MAP_SHARED, fd, meta_len); +if (data == MAP_FAILED) +exit(EXIT_FAILURE); + +if (ioctl(fd, TRACE_MMAP_IOCTL_GET_READER) < 0) +exit(EXIT_FAILURE); + +reader_id = meta->reader.id; +reader = data + meta->subbuf_size * reader_id; + +printf("Current reader address: %p\n", reader); + +munmap(data, data_len); +munmap(meta, meta_len); +close (fd); + +return 0; +} -- 2.44.0.rc0.258.g7320e95886-goog
[PATCH v18 3/6] tracing: Add snapshot refcount
When a ring-buffer is memory mapped by user-space, no trace or ring-buffer swap is possible. This means the snapshot feature is mutually exclusive with the memory mapping. Having a refcount on snapshot users will help to know if a mapping is possible or not. Instead of relying on the global trace_types_lock, a new spinlock is introduced to serialize accesses to trace_array->snapshot. This intends to allow access to that variable in a context where the mmap lock is already held. Signed-off-by: Vincent Donnefort diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..8ae7c2cb63a0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1301,6 +1301,50 @@ static void free_snapshot(struct trace_array *tr) tr->allocated_snapshot = false; } +static int tracing_arm_snapshot_locked(struct trace_array *tr) +{ + int ret; + + lockdep_assert_held(_types_lock); + + spin_lock(>snapshot_trigger_lock); + if (tr->snapshot == UINT_MAX) { + spin_unlock(>snapshot_trigger_lock); + return -EBUSY; + } + + tr->snapshot++; + spin_unlock(>snapshot_trigger_lock); + + ret = tracing_alloc_snapshot_instance(tr); + if (ret) { + spin_lock(>snapshot_trigger_lock); + tr->snapshot--; + spin_unlock(>snapshot_trigger_lock); + } + + return ret; +} + +int tracing_arm_snapshot(struct trace_array *tr) +{ + int ret; + + mutex_lock(_types_lock); + ret = tracing_arm_snapshot_locked(tr); + mutex_unlock(_types_lock); + + return ret; +} + +void tracing_disarm_snapshot(struct trace_array *tr) +{ + spin_lock(>snapshot_trigger_lock); + if (!WARN_ON(!tr->snapshot)) + tr->snapshot--; + spin_unlock(>snapshot_trigger_lock); +} + /** * tracing_alloc_snapshot - allocate snapshot buffer. * @@ -1374,10 +1418,6 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, mutex_lock(_types_lock); - ret = tracing_alloc_snapshot_instance(tr); - if (ret) - goto fail_unlock; - if (tr->current_trace->use_max_tr) { ret = -EBUSY; goto fail_unlock; @@ -1396,6 +1436,10 @@ int tracing_snapshot_cond_enable(struct trace_array *tr, void *cond_data, goto fail_unlock; } + ret = tracing_arm_snapshot_locked(tr); + if (ret) + goto fail_unlock; + local_irq_disable(); arch_spin_lock(>max_lock); tr->cond_snapshot = cond_snapshot; @@ -1440,6 +1484,8 @@ int tracing_snapshot_cond_disable(struct trace_array *tr) arch_spin_unlock(>max_lock); local_irq_enable(); + tracing_disarm_snapshot(tr); + return ret; } EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable); @@ -1482,6 +1528,7 @@ int tracing_snapshot_cond_disable(struct trace_array *tr) } EXPORT_SYMBOL_GPL(tracing_snapshot_cond_disable); #define free_snapshot(tr) do { } while (0) +#define tracing_arm_snapshot_locked(tr) ({ -EBUSY; }) #endif /* CONFIG_TRACER_SNAPSHOT */ void tracer_tracing_off(struct trace_array *tr) @@ -6595,11 +6642,12 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf) */ synchronize_rcu(); free_snapshot(tr); + tracing_disarm_snapshot(tr); } - if (t->use_max_tr && !tr->allocated_snapshot) { - ret = tracing_alloc_snapshot_instance(tr); - if (ret < 0) + if (t->use_max_tr) { + ret = tracing_arm_snapshot_locked(tr); + if (ret) goto out; } #else @@ -6608,8 +6656,13 @@ int tracing_set_tracer(struct trace_array *tr, const char *buf) if (t->init) { ret = tracer_init(t, tr); - if (ret) + if (ret) { +#ifdef CONFIG_TRACER_MAX_TRACE + if (t->use_max_tr) + tracing_disarm_snapshot(tr); +#endif goto out; + } } tr->current_trace = t; @@ -7711,10 +7764,11 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt, if (tr->allocated_snapshot) ret = resize_buffer_duplicate_size(>max_buffer, >array_buffer, iter->cpu_file); - else - ret = tracing_alloc_snapshot_instance(tr); - if (ret < 0) + + ret = tracing_arm_snapshot_locked(tr); + if (ret) break; + /* Now, we're going to swap */ if (iter->cpu_file == RING_BUFFER_ALL_CPUS) { local_irq_disable(); @@ -7724,6 +7778,7 @@ tracing_snapshot_write(struct file *filp, const char __user *ubuf, size_t cnt,
[PATCH v18 4/6] tracing: Allow user-space mapping of the ring-buffer
Currently, user-space extracts data from the ring-buffer via splice, which is handy for storage or network sharing. However, due to splice limitations, it is imposible to do real-time analysis without a copy. A solution for that problem is to let the user-space map the ring-buffer directly. The mapping is exposed via the per-CPU file trace_pipe_raw. The first element of the mapping is the meta-page. It is followed by each subbuffer constituting the ring-buffer, ordered by their unique page ID: * Meta-page -- include/uapi/linux/trace_mmap.h for a description * Subbuf ID 0 * Subbuf ID 1 ... It is therefore easy to translate a subbuf ID into an offset in the mapping: reader_id = meta->reader->id; reader_offset = meta->meta_page_size + reader_id * meta->subbuf_size; When new data is available, the mapper must call a newly introduced ioctl: TRACE_MMAP_IOCTL_GET_READER. This will update the Meta-page reader ID to point to the next reader containing unread data. Mapping will prevent snapshot and buffer size modifications. Signed-off-by: Vincent Donnefort diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h index ffcd8dfcaa4f..d25b9d504a7c 100644 --- a/include/uapi/linux/trace_mmap.h +++ b/include/uapi/linux/trace_mmap.h @@ -43,4 +43,6 @@ struct trace_buffer_meta { __u64 Reserved2; }; +#define TRACE_MMAP_IOCTL_GET_READER_IO('T', 0x1) + #endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8ae7c2cb63a0..67ce7b367edb 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -1176,6 +1176,12 @@ static void tracing_snapshot_instance_cond(struct trace_array *tr, return; } + if (tr->mapped) { + trace_array_puts(tr, "*** BUFFER MEMORY MAPPED ***\n"); + trace_array_puts(tr, "*** Can not use snapshot (sorry) ***\n"); + return; + } + local_irq_save(flags); update_max_tr(tr, current, smp_processor_id(), cond_data); local_irq_restore(flags); @@ -1308,7 +1314,7 @@ static int tracing_arm_snapshot_locked(struct trace_array *tr) lockdep_assert_held(_types_lock); spin_lock(>snapshot_trigger_lock); - if (tr->snapshot == UINT_MAX) { + if (tr->snapshot == UINT_MAX || tr->mapped) { spin_unlock(>snapshot_trigger_lock); return -EBUSY; } @@ -6535,7 +6541,7 @@ static void tracing_set_nop(struct trace_array *tr) { if (tr->current_trace == _trace) return; - + tr->current_trace->enabled--; if (tr->current_trace->reset) @@ -8654,15 +8660,31 @@ tracing_buffers_splice_read(struct file *file, loff_t *ppos, return ret; } -/* An ioctl call with cmd 0 to the ring buffer file will wake up all waiters */ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct ftrace_buffer_info *info = file->private_data; struct trace_iterator *iter = >iter; + int err; - if (cmd) - return -ENOIOCTLCMD; + if (cmd == TRACE_MMAP_IOCTL_GET_READER) { + if (!(file->f_flags & O_NONBLOCK)) { + err = ring_buffer_wait(iter->array_buffer->buffer, + iter->cpu_file, + iter->tr->buffer_percent); + if (err) + return err; + } + return ring_buffer_map_get_reader(iter->array_buffer->buffer, + iter->cpu_file); + } else if (cmd) { + return -ENOTTY; + } + + /* +* An ioctl call with cmd 0 to the ring buffer file will wake up all +* waiters +*/ mutex_lock(_types_lock); iter->wait_index++; @@ -8675,6 +8697,110 @@ static long tracing_buffers_ioctl(struct file *file, unsigned int cmd, unsigned return 0; } +static vm_fault_t tracing_buffers_mmap_fault(struct vm_fault *vmf) +{ + struct ftrace_buffer_info *info = vmf->vma->vm_file->private_data; + struct trace_iterator *iter = >iter; + vm_fault_t ret = VM_FAULT_SIGBUS; + struct page *page; + + page = ring_buffer_map_fault(iter->array_buffer->buffer, iter->cpu_file, +vmf->pgoff); + if (!page) + return ret; + + get_page(page); + vmf->page = page; + vmf->page->mapping = vmf->vma->vm_file->f_mapping; + vmf->page->index = vmf->pgoff; + + return 0; +} + +#ifdef CONFIG_TRACER_MAX_TRACE +static int get_snapshot_map(struct trace_array *tr) +{ + int err = 0; + + /* +* Called with mmap_lock held. lockdep would be unhappy if we would now +* take trace_types_lock. Instead use the specific +* snapshot_trigger_lock. +*/ +
[PATCH v18 2/6] ring-buffer: Introducing ring-buffer mapping functions
In preparation for allowing the user-space to map a ring-buffer, add a set of mapping functions: ring_buffer_{map,unmap}() ring_buffer_map_fault() And controls on the ring-buffer: ring_buffer_map_get_reader() /* swap reader and head */ Mapping the ring-buffer also involves: A unique ID for each subbuf of the ring-buffer, currently they are only identified through their in-kernel VA. A meta-page, where are stored ring-buffer statistics and a description for the current reader The linear mapping exposes the meta-page, and each subbuf of the ring-buffer, ordered following their unique ID, assigned during the first mapping. Once mapped, no subbuf can get in or out of the ring-buffer: the buffer size will remain unmodified and the splice enabling functions will in reality simply memcpy the data instead of swapping subbufs. Signed-off-by: Vincent Donnefort diff --git a/include/linux/ring_buffer.h b/include/linux/ring_buffer.h index fa802db216f9..0841ba8bab14 100644 --- a/include/linux/ring_buffer.h +++ b/include/linux/ring_buffer.h @@ -6,6 +6,8 @@ #include #include +#include + struct trace_buffer; struct ring_buffer_iter; @@ -221,4 +223,9 @@ int trace_rb_cpu_prepare(unsigned int cpu, struct hlist_node *node); #define trace_rb_cpu_prepare NULL #endif +int ring_buffer_map(struct trace_buffer *buffer, int cpu); +int ring_buffer_unmap(struct trace_buffer *buffer, int cpu); +struct page *ring_buffer_map_fault(struct trace_buffer *buffer, int cpu, + unsigned long pgoff); +int ring_buffer_map_get_reader(struct trace_buffer *buffer, int cpu); #endif /* _LINUX_RING_BUFFER_H */ diff --git a/include/uapi/linux/trace_mmap.h b/include/uapi/linux/trace_mmap.h new file mode 100644 index ..ffcd8dfcaa4f --- /dev/null +++ b/include/uapi/linux/trace_mmap.h @@ -0,0 +1,46 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _TRACE_MMAP_H_ +#define _TRACE_MMAP_H_ + +#include + +/** + * struct trace_buffer_meta - Ring-buffer Meta-page description + * @meta_page_size:Size of this meta-page. + * @meta_struct_len: Size of this structure. + * @subbuf_size: Size of each sub-buffer. + * @nr_subbufs:Number of subbfs in the ring-buffer, including the reader. + * @reader.lost_events:Number of events lost at the time of the reader swap. + * @reader.id: subbuf ID of the current reader. ID range [0 : @nr_subbufs - 1] + * @reader.read: Number of bytes read on the reader subbuf. + * @flags: Placeholder for now, 0 until new features are supported. + * @entries: Number of entries in the ring-buffer. + * @overrun: Number of entries lost in the ring-buffer. + * @read: Number of entries that have been read. + * @Reserved1: Reserved for future use. + * @Reserved2: Reserved for future use. + */ +struct trace_buffer_meta { + __u32 meta_page_size; + __u32 meta_struct_len; + + __u32 subbuf_size; + __u32 nr_subbufs; + + struct { + __u64 lost_events; + __u32 id; + __u32 read; + } reader; + + __u64 flags; + + __u64 entries; + __u64 overrun; + __u64 read; + + __u64 Reserved1; + __u64 Reserved2; +}; + +#endif /* _TRACE_MMAP_H_ */ diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index ca796675c0a1..1d7d7a701867 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -9,6 +9,7 @@ #include #include #include +#include #include #include #include @@ -338,6 +339,7 @@ struct buffer_page { local_t entries; /* entries on this page */ unsigned longreal_end; /* real end of data */ unsigned order; /* order of the page */ + u32 id;/* ID for external mapping */ struct buffer_data_page *page; /* Actual data page */ }; @@ -484,6 +486,12 @@ struct ring_buffer_per_cpu { u64 read_stamp; /* pages removed since last reset */ unsigned long pages_removed; + + unsigned intmapped; + struct mutexmapping_lock; + unsigned long *subbuf_ids;/* ID to subbuf addr */ + struct trace_buffer_meta*meta_page; + /* ring buffer pages to update, > 0 to add, < 0 to remove */ longnr_pages_to_update; struct list_headnew_pages; /* new pages to add */ @@ -1548,6 +1556,7 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) init_irq_work(_buffer->irq_work.work, rb_wake_up_waiters); init_waitqueue_head(_buffer->irq_work.waiters); init_waitqueue_head(_buffer->irq_work.full_waiters); +
[PATCH v18 1/6] ring-buffer: Zero ring-buffer sub-buffers
In preparation for the ring-buffer memory mapping where each subbuf will be accessible to user-space, zero all the page allocations. Signed-off-by: Vincent Donnefort Reviewed-by: Masami Hiramatsu (Google) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index fd4bfe3ecf01..ca796675c0a1 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -1472,7 +1472,8 @@ static int __rb_allocate_pages(struct ring_buffer_per_cpu *cpu_buffer, list_add(>list, pages); - page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), mflags, + page = alloc_pages_node(cpu_to_node(cpu_buffer->cpu), + mflags | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) goto free_pages; @@ -1557,7 +1558,8 @@ rb_allocate_cpu_buffer(struct trace_buffer *buffer, long nr_pages, int cpu) cpu_buffer->reader_page = bpage; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL, cpu_buffer->buffer->subbuf_order); + page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_ZERO, + cpu_buffer->buffer->subbuf_order); if (!page) goto fail_free_reader; bpage->page = page_address(page); @@ -5525,7 +5527,8 @@ ring_buffer_alloc_read_page(struct trace_buffer *buffer, int cpu) if (bpage->data) goto out; - page = alloc_pages_node(cpu_to_node(cpu), GFP_KERNEL | __GFP_NORETRY, + page = alloc_pages_node(cpu_to_node(cpu), + GFP_KERNEL | __GFP_NORETRY | __GFP_ZERO, cpu_buffer->buffer->subbuf_order); if (!page) { kfree(bpage); -- 2.44.0.rc0.258.g7320e95886-goog
[PATCH v18 0/6] Introducing trace buffer mapping by user-space
The tracing ring-buffers can be stored on disk or sent to network without any copy via splice. However the later doesn't allow real time processing of the traces. A solution is to give userspace direct access to the ring-buffer pages via a mapping. An application can now become a consumer of the ring-buffer, in a similar fashion to what trace_pipe offers. Support for this new feature can already be found in libtracefs from version 1.8, when built with EXTRA_CFLAGS=-DFORCE_MMAP_ENABLE. Vincent v17 -> v18: * Fix lockdep_assert_held * Fix spin_lock_init typo * Fix CONFIG_TRACER_MAX_TRACE typo v16 -> v17: * Documentation and comments improvements. * Create get/put_snapshot_map() for clearer code. * Replace kzalloc with kcalloc. * Fix -ENOMEM handling in rb_alloc_meta_page(). * Move flush(cpu_buffer->reader_page) behind the reader lock. * Move all inc/dec of cpu_buffer->mapped behind reader lock and buffer mutex. (removes READ_ONCE/WRITE_ONCE accesses). v15 -> v16: * Add comment for the dcache flush. * Remove now unnecessary WRITE_ONCE for the meta-page. v14 -> v15: * Add meta-page and reader-page flush. Intends to fix the mapping for VIVT and aliasing-VIPT data caches. * -EPERM on VM_EXEC. * Fix build warning !CONFIG_TRACER_MAX_TRACE. v13 -> v14: * All cpu_buffer->mapped readers use READ_ONCE (except for swap_cpu) * on unmap, sync meta-page teardown with the reader_lock instead of the synchronize_rcu. * Add a dedicated spinlock for trace_array ->snapshot and ->mapped. (intends to fix a lockdep issue) * Add kerneldoc for flags and Reserved fields. * Add kselftest for snapshot/map mutual exclusion. v12 -> v13: * Swap subbufs_{touched,lost} for Reserved fields. * Add a flag field in the meta-page. * Fix CONFIG_TRACER_MAX_TRACE. * Rebase on top of trace/urgent. * Add a comment for try_unregister_trigger() v11 -> v12: * Fix code sample mmap bug. * Add logging in sample code. * Reset tracer in selftest. * Add a refcount for the snapshot users. * Prevent mapping when there are snapshot users and vice versa. * Refine the meta-page. * Fix types in the meta-page. * Collect Reviewed-by. v10 -> v11: * Add Documentation and code sample. * Add a selftest. * Move all the update to the meta-page into a single rb_update_meta_page(). * rb_update_meta_page() is now called from ring_buffer_map_get_reader() to fix NOBLOCK callers. * kerneldoc for struct trace_meta_page. * Add a patch to zero all the ring-buffer allocations. v9 -> v10: * Refactor rb_update_meta_page() * In-loop declaration for foreach_subbuf_page() * Check for cpu_buffer->mapped overflow v8 -> v9: * Fix the unlock path in ring_buffer_map() * Fix cpu_buffer cast with rb_work_rq->is_cpu_buffer * Rebase on linux-trace/for-next (3cb3091138ca0921c4569bcf7ffa062519639b6a) v7 -> v8: * Drop the subbufs renaming into bpages * Use subbuf as a name when relevant v6 -> v7: * Rebase onto lore.kernel.org/lkml/20231215175502.106587...@goodmis.org/ * Support for subbufs * Rename subbufs into bpages v5 -> v6: * Rebase on next-20230802. * (unsigned long) -> (void *) cast for virt_to_page(). * Add a wait for the GET_READER_PAGE ioctl. * Move writer fields update (overrun/pages_lost/entries/pages_touched) in the irq_work. * Rearrange id in struct buffer_page. * Rearrange the meta-page. * ring_buffer_meta_page -> trace_buffer_meta_page. * Add meta_struct_len into the meta-page. v4 -> v5: * Trivial rebase onto 6.5-rc3 (previously 6.4-rc3) v3 -> v4: * Add to the meta-page: - pages_lost / pages_read (allow to compute how full is the ring-buffer) - read (allow to compute how many entries can be read) - A reader_page struct. * Rename ring_buffer_meta_header -> ring_buffer_meta * Rename ring_buffer_get_reader_page -> ring_buffer_map_get_reader_page * Properly consume events on ring_buffer_map_get_reader_page() with rb_advance_reader(). v2 -> v3: * Remove data page list (for non-consuming read) ** Implies removing order > 0 meta-page * Add a new meta page field ->read * Rename ring_buffer_meta_page_header into ring_buffer_meta_header v1 -> v2: * Hide data_pages from the userspace struct * Fix META_PAGE_MAX_PAGES * Support for order > 0 meta-page * Add missing page->mapping. Vincent Donnefort (6): ring-buffer: Zero ring-buffer sub-buffers ring-buffer: Introducing ring-buffer mapping functions tracing: Add snapshot refcount tracing: Allow user-space mapping of the ring-buffer Documentation: tracing: Add ring-buffer mapping ring-buffer/selftest: Add ring-buffer mapping test Documentation/trace/index.rst | 1 + Documentation/trace/ring-buffer-map.rst | 106 + include/linux/ring_buffer.h | 7 + include/uapi/linux/trace_mmap.h | 48 +++ kernel/trace/ring_buffer.c| 385
Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Tue, 2024-02-20 at 14:18 +0100, Michal Koutný wrote: > On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai" > wrote: > > I am not sure, but is it possible or legal for an ancestor to have less > > limit > > than children? > > Why not? > It is desired for proper hiearchical delegation and the tightest limit > of ancestors applies (cf memory.max). > OK. Thanks for the info.
Re: [PATCH v3 1/7] remoteproc: Add TEE support
Good morning, On Wed, Feb 14, 2024 at 06:21:21PM +0100, Arnaud Pouliquen wrote: > From: Arnaud Pouliquen > > Add a remoteproc TEE (Trusted Execution Environment) driver > that will be probed by the TEE bus. If the associated Trusted > application is supported on secure part this device offers a client > interface to load a firmware in the secure part. > This firmware could be authenticated and decrypted by the secure > trusted application. > > Signed-off-by: Arnaud Pouliquen > --- > update from V2 > - Use 'tee_rproc' prefix for all functions > - rename tee_rproc_get_loaded_rsc_table to tee_rproc_find_loaded_rsc_table > - redefine fonction to better match with the rproc_ops structure format > - replace 'struct tee_rproc' parameter by 'struct rproc' parameter > - rename 'rproc_tee_get_rsc_table()' to tee_rproc_get_loaded_rsc_table() > and rework it to remove the cached_table management. > - introduce tee_rproc_get_context() to get the tee_rproc struct from the > rproc struct > - rename tee_rproc_get_loaded_rsc_table() to > tee_rproc_find_loaded_rsc_table() > - remove useless check on tee_rproc_ctx structure in tee_rproc_register() > and tee_rproc_unregister() > - fix test on the return of tee_rproc_ctx = devm_kzalloc() > - remove useless includes and unused tee_rproc_mem structure. > --- > drivers/remoteproc/Kconfig | 9 + > drivers/remoteproc/Makefile | 1 + > drivers/remoteproc/tee_remoteproc.c | 397 > include/linux/tee_remoteproc.h | 102 +++ > 4 files changed, 509 insertions(+) > create mode 100644 drivers/remoteproc/tee_remoteproc.c > create mode 100644 include/linux/tee_remoteproc.h > > diff --git a/drivers/remoteproc/Kconfig b/drivers/remoteproc/Kconfig > index 48845dc8fa85..85299606806c 100644 > --- a/drivers/remoteproc/Kconfig > +++ b/drivers/remoteproc/Kconfig > @@ -365,6 +365,15 @@ config XLNX_R5_REMOTEPROC > > It's safe to say N if not interested in using RPU r5f cores. > > + > +config TEE_REMOTEPROC > + tristate "trusted firmware support by a TEE application" s/trusted/Trusted And the wording will have to change a little, I will advise on this later. > + depends on OPTEE > + help > + Support for trusted remote processors firmware. The firmware > + authentication and/or decryption are managed by a trusted application. > + This can be either built-in or a loadable module. > + > endif # REMOTEPROC > > endmenu > diff --git a/drivers/remoteproc/Makefile b/drivers/remoteproc/Makefile > index 91314a9b43ce..fa8daebce277 100644 > --- a/drivers/remoteproc/Makefile > +++ b/drivers/remoteproc/Makefile > @@ -36,6 +36,7 @@ obj-$(CONFIG_RCAR_REMOTEPROC) += rcar_rproc.o > obj-$(CONFIG_ST_REMOTEPROC) += st_remoteproc.o > obj-$(CONFIG_ST_SLIM_REMOTEPROC) += st_slim_rproc.o > obj-$(CONFIG_STM32_RPROC)+= stm32_rproc.o > +obj-$(CONFIG_TEE_REMOTEPROC) += tee_remoteproc.o > obj-$(CONFIG_TI_K3_DSP_REMOTEPROC) += ti_k3_dsp_remoteproc.o > obj-$(CONFIG_TI_K3_R5_REMOTEPROC)+= ti_k3_r5_remoteproc.o > obj-$(CONFIG_XLNX_R5_REMOTEPROC) += xlnx_r5_remoteproc.o > diff --git a/drivers/remoteproc/tee_remoteproc.c > b/drivers/remoteproc/tee_remoteproc.c > new file mode 100644 > index ..ac727e062d00 > --- /dev/null > +++ b/drivers/remoteproc/tee_remoteproc.c > @@ -0,0 +1,397 @@ > +// SPDX-License-Identifier: GPL-2.0-or-later > +/* > + * Copyright (C) STMicroelectronics 2023 - All Rights Reserved > + * Author: Arnaud Pouliquen > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "remoteproc_internal.h" > + > +#define MAX_TEE_PARAM_ARRY_MEMBER4 > + > +/* > + * Authentication of the firmware and load in the remote processor memory > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [in] params[1].memref: buffer containing the image of the > buffer > + */ > +#define TA_RPROC_FW_CMD_LOAD_FW 1 > + > +/* > + * Start the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + */ > +#define TA_RPROC_FW_CMD_START_FW 2 > + > +/* > + * Stop the remote processor > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + */ > +#define TA_RPROC_FW_CMD_STOP_FW 3 > + > +/* > + * Return the address of the resource table, or 0 if not found > + * No check is done to verify that the address returned is accessible by > + * the non secure context. If the resource table is loaded in a protected > + * memory the access by the non secure context will lead to a data abort. > + * > + * [in] params[0].value.a: unique 32bit identifier of the remote processor > + * [out] params[1].value.a: 32bit LSB resource table memory address > + * [out] params[1].value.b: 32bit MSB resource table memory
Re: [PATCH] MAINTAINERS: add memory mapping entry with reviewers
On 2/20/24 07:44, Lorenzo Stoakes wrote: > Recently there have been a number of patches which have affected various > aspects of the memory mapping logic as implemented in mm/mmap.c where it > would have been useful for regular contributors to have been notified. > > Add an entry for this part of mm in particular with regular contributors > tagged as reviewers. > > Signed-off-by: Lorenzo Stoakes Acked-by: Vlastimil Babka Would be nice if this targetted sub-reviewing could be managed in a simpler way as part of the MM section instead of having to define a new one which duplicates half of if, but until we have such support... > --- > MAINTAINERS | 11 +++ > 1 file changed, 11 insertions(+) > > diff --git a/MAINTAINERS b/MAINTAINERS > index b6c5c92e49db..129a237b7880 100644 > --- a/MAINTAINERS > +++ b/MAINTAINERS > @@ -14121,6 +14121,17 @@ F: tools/mm/ > F: tools/testing/selftests/mm/ > N: include/linux/page[-_]* > > +MEMORY MAPPING > +M: Andrew Morton > +R: Liam R. Howlett > +R: Vlastimil Babka > +R: Lorenzo Stoakes > +L: linux...@kvack.org > +S: Maintained > +W: http://www.linux-mm.org > +T: git git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm > +F: mm/mmap.c > + > MEMORY TECHNOLOGY DEVICES (MTD) > M: Miquel Raynal > M: Richard Weinberger
Re: [PATCH v5 0/4] Add samsung-matisselte and common matisse dtsi
On Tue, 31 Oct 2023 15:00:54 +0100, Stefan Hansson wrote: > This series adds a common samsung-matisse dtsi and reworks > samsung-matisse-wifi to use it, and introduces samsung-matisselte. I > choose matisselte over matisse-lte as this is how most other devices > (klte, s3ve3g) do it and it is the codename that Samsung gave the > device. See individual commits for more information. > Applied, thanks! [1/4] ARM: dts: qcom: samsung-matisse-common: Add initial common device tree (no commit info) [2/4] dt-bindings: arm: qcom: Add Samsung Galaxy Tab 4 10.1 LTE commit: 2a478a521876d9daea016be3e9513a67871169b5 [3/4] ARM: dts: qcom: Add support for Samsung Galaxy Tab 4 10.1 LTE (SM-T535) (no commit info) [4/4] ARM: dts: qcom: samsung-matisse-common: Add UART (no commit info) Best regards, -- Bjorn Andersson
[RFC PATCH 11/14] csky/thread_info: Introduce TIF_NOTIFY_IPI flag
Add support for TIF_NOTIFY_IPI on C-SKY. With TIF_NOTIFY_IPI, a sender sending an IPI to an idle CPU in TIF_POLLING mode will set the TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This avoids spurious calls to schedule_idle() in cases where an IPI does not necessarily wake up a task on the idle CPU. Cc: Guo Ren Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: K Prateek Nayak Cc: linux-c...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: K Prateek Nayak --- arch/csky/include/asm/thread_info.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/csky/include/asm/thread_info.h b/arch/csky/include/asm/thread_info.h index b5ed788f0c68..9bc7a037c476 100644 --- a/arch/csky/include/asm/thread_info.h +++ b/arch/csky/include/asm/thread_info.h @@ -61,6 +61,7 @@ static inline struct thread_info *current_thread_info(void) #define TIF_SYSCALL_TRACEPOINT 5 /* syscall tracepoint instrumentation */ #define TIF_SYSCALL_AUDIT 6 /* syscall auditing */ #define TIF_NOTIFY_SIGNAL 7 /* signal notifications exist */ +#define TIF_NOTIFY_IPI 8 /* Pending IPI on TIF_POLLLING idle CPU */ #define TIF_POLLING_NRFLAG 16 /* poll_idle() is TIF_NEED_RESCHED */ #define TIF_MEMDIE 18 /* is terminating due to OOM killer */ #define TIF_RESTORE_SIGMASK20 /* restore signal mask in do_signal() */ @@ -73,6 +74,7 @@ static inline struct thread_info *current_thread_info(void) #define _TIF_SYSCALL_TRACEPOINT(1 << TIF_SYSCALL_TRACEPOINT) #define _TIF_SYSCALL_AUDIT (1 << TIF_SYSCALL_AUDIT) #define _TIF_NOTIFY_SIGNAL (1 << TIF_NOTIFY_SIGNAL) +#define _TIF_NOTIFY_IPI(1 << TIF_NOTIFY_IPI) #define _TIF_UPROBE(1 << TIF_UPROBE) #define _TIF_POLLING_NRFLAG(1 << TIF_POLLING_NRFLAG) #define _TIF_MEMDIE(1 << TIF_MEMDIE) -- 2.34.1
[RFC PATCH 07/14] openrisc/thread_info: Introduce TIF_NOTIFY_IPI flag
Add support for TIF_NOTIFY_IPI on OpenRISC. With TIF_NOTIFY_IPI, a sender sending an IPI to an idle CPU in TIF_POLLING mode will set the TIF_NOTIFY_IPI flag in the target's idle tasks's thread_info to pull the CPU out of idle, as opposed to setting TIF_NEED_RESCHED previously. This avoids spurious calls to schedule_idle() in cases where an IPI does not necessarily wake up a task on the idle CPU. Cc: Jonas Bonn Cc: Stefan Kristiansson Cc: Stafford Horne Cc: "Rafael J. Wysocki" Cc: Daniel Lezcano Cc: Ingo Molnar Cc: Peter Zijlstra Cc: Juri Lelli Cc: Vincent Guittot Cc: Dietmar Eggemann Cc: Steven Rostedt Cc: Ben Segall Cc: Mel Gorman Cc: Daniel Bristot de Oliveira Cc: Valentin Schneider Cc: K Prateek Nayak Cc: linux-openr...@vger.kernel.org Cc: linux-kernel@vger.kernel.org Cc: linux...@vger.kernel.org Signed-off-by: K Prateek Nayak --- arch/openrisc/include/asm/thread_info.h | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/openrisc/include/asm/thread_info.h b/arch/openrisc/include/asm/thread_info.h index 4af3049c34c2..6a386703bc43 100644 --- a/arch/openrisc/include/asm/thread_info.h +++ b/arch/openrisc/include/asm/thread_info.h @@ -92,6 +92,7 @@ register struct thread_info *current_thread_info_reg asm("r10"); * mode */ #define TIF_NOTIFY_SIGNAL 5 /* signal notifications exist */ +#define TIF_NOTIFY_IPI 6 /* Pending IPI on TIF_POLLLING idle CPU */ #define TIF_SYSCALL_TRACEPOINT 8 /* for ftrace syscall instrumentation */ #define TIF_RESTORE_SIGMASK 9 #define TIF_POLLING_NRFLAG 16 /* true if poll_idle() is polling * TIF_NEED_RESCHED @@ -104,6 +105,7 @@ register struct thread_info *current_thread_info_reg asm("r10"); #define _TIF_NEED_RESCHED (1<
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On Tue, 20 Feb 2024 09:50:13 -0500 Mathieu Desnoyers wrote: > On 2024-02-20 09:19, Steven Rostedt wrote: > > On Mon, 19 Feb 2024 18:20:32 -0500 > > Steven Rostedt wrote: > > > >> Instead of using local_add_return() to reserve the ring buffer data, > >> Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the > >> reservation with the time keeping code. > >> > >> Although, it does not get rid of the double time stamps (before_stamp and > >> write_stamp), using cmpxchg() does get rid of the more complex case when > >> an interrupting event occurs between getting the timestamps and reserving > >> the data, as when that happens, it just tries again instead of dealing > >> with it. > >> > >> Before we had: > >> > >>w = local_read(_page->write); > >>/* get time stamps */ > >>write = local_add_return(length, _page->write); > >>if (write - length == w) { > >>/* do simple case */ > >>} else { > >>/* do complex case */ > >>} > >> > >> By switching the local_add_return() to a local_try_cmpxchg() it can now be: > >> > >> w = local_read(_page->write); > >> again: > >>/* get time stamps */ > >>if (!local_try_cmpxchg(_page->write, , w + length)) > >>goto again; > >> > >> /* do simple case */ > > > > Something about this logic is causing __rb_next_reserve() to sometimes > > always return -EAGAIN and triggering the: > > > > RB_WARN_ON(cpu_buffer, ++nr_loops > 1000) > > > > Which disables the ring buffer. > > > > I'm not sure what it is, but until I do, I'm removing the patch from my > > queue. > > Try resetting the info->add_timestamp flags to add_ts_default on goto again > within __rb_reserve_next(). > I was looking at that too, but I don't know how it will make a difference. Note, the test that fails is in my test suite, and takes about a half hour to get there. Running that suite takes up resources (it's my main test suite for all changes). I'm currently testing other patches so I either need to figure it out through inspection, or this will need to wait a while. -- Steve
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On 2024-02-20 09:19, Steven Rostedt wrote: On Mon, 19 Feb 2024 18:20:32 -0500 Steven Rostedt wrote: Instead of using local_add_return() to reserve the ring buffer data, Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the reservation with the time keeping code. Although, it does not get rid of the double time stamps (before_stamp and write_stamp), using cmpxchg() does get rid of the more complex case when an interrupting event occurs between getting the timestamps and reserving the data, as when that happens, it just tries again instead of dealing with it. Before we had: w = local_read(_page->write); /* get time stamps */ write = local_add_return(length, _page->write); if (write - length == w) { /* do simple case */ } else { /* do complex case */ } By switching the local_add_return() to a local_try_cmpxchg() it can now be: w = local_read(_page->write); again: /* get time stamps */ if (!local_try_cmpxchg(_page->write, , w + length)) goto again; /* do simple case */ Something about this logic is causing __rb_next_reserve() to sometimes always return -EAGAIN and triggering the: RB_WARN_ON(cpu_buffer, ++nr_loops > 1000) Which disables the ring buffer. I'm not sure what it is, but until I do, I'm removing the patch from my queue. Try resetting the info->add_timestamp flags to add_ts_default on goto again within __rb_reserve_next(). Thanks, Mathieu -- Mathieu Desnoyers EfficiOS Inc. https://www.efficios.com
[PATCH] ring-buffer: Do not let subbuf be bigger than write mask
From: "Steven Rostedt (Google)" The data on the subbuffer is measured by a write variable that also contains status flags. The counter is just 20 bits in length. If the subbuffer is bigger than then counter, it will fail. Make sure that the subbuffer can not be set to greater than the counter that keeps track of the data on the subbuffer. Fixes: 2808e31ec12e5 ("ring-buffer: Add interface for configuring trace sub buffer size") Signed-off-by: Steven Rostedt (Google) --- kernel/trace/ring_buffer.c | 4 1 file changed, 4 insertions(+) diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c index aa54266f5e1f..3852f3b001cc 100644 --- a/kernel/trace/ring_buffer.c +++ b/kernel/trace/ring_buffer.c @@ -5878,6 +5878,10 @@ int ring_buffer_subbuf_order_set(struct trace_buffer *buffer, int order) if (psize <= BUF_PAGE_HDR_SIZE) return -EINVAL; + /* Size of a subbuf cannot be greater than the write counter */ + if (psize > RB_WRITE_MASK + 1) + return -EINVAL; + old_order = buffer->subbuf_order; old_size = buffer->subbuf_size; -- 2.43.0
Re: [PATCH v3] ring-buffer: Simplify reservation with try_cmpxchg() loop
On Mon, 19 Feb 2024 18:20:32 -0500 Steven Rostedt wrote: > Instead of using local_add_return() to reserve the ring buffer data, > Mathieu Desnoyers suggested using local_cmpxchg(). This would simplify the > reservation with the time keeping code. > > Although, it does not get rid of the double time stamps (before_stamp and > write_stamp), using cmpxchg() does get rid of the more complex case when > an interrupting event occurs between getting the timestamps and reserving > the data, as when that happens, it just tries again instead of dealing > with it. > > Before we had: > > w = local_read(_page->write); > /* get time stamps */ > write = local_add_return(length, _page->write); > if (write - length == w) { > /* do simple case */ > } else { > /* do complex case */ > } > > By switching the local_add_return() to a local_try_cmpxchg() it can now be: > >w = local_read(_page->write); > again: > /* get time stamps */ > if (!local_try_cmpxchg(_page->write, , w + length)) > goto again; > >/* do simple case */ Something about this logic is causing __rb_next_reserve() to sometimes always return -EAGAIN and triggering the: RB_WARN_ON(cpu_buffer, ++nr_loops > 1000) Which disables the ring buffer. I'm not sure what it is, but until I do, I'm removing the patch from my queue. -- Steve
[PATCH v4 3/3] tracing: Move saved_cmdline code into trace_sched_switch.c
From: "Steven Rostedt (Google)" The code that handles saved_cmdlines is split between the trace.c file and the trace_sched_switch.c. There's some history to this. The trace_sched_switch.c was originally created to handle the sched_switch tracer that was deprecated due to sched_switch trace event making it obsolete. But that file did not get deleted as it had some code to help with saved_cmdlines. But trace.c has grown tremendously since then. Just move all the saved_cmdlines code into trace_sched_switch.c as that's the only reason that file still exists, and trace.c has gotten too big. No functional changes. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 515 +- kernel/trace/trace.h | 10 + kernel/trace/trace_sched_switch.c | 515 ++ 3 files changed, 528 insertions(+), 512 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 06c593fc93d0..50fab999e72e 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -39,7 +39,6 @@ #include #include #include -#include #include #include #include @@ -105,7 +104,7 @@ dummy_set_flag(struct trace_array *tr, u32 old_flags, u32 bit, int set) * tracing is active, only save the comm when a trace event * occurred. */ -static DEFINE_PER_CPU(bool, trace_taskinfo_save); +DEFINE_PER_CPU(bool, trace_taskinfo_save); /* * Kill all tracing for good (never come back). @@ -2299,96 +2298,6 @@ void tracing_reset_all_online_cpus(void) mutex_unlock(_types_lock); } -/* - * The tgid_map array maps from pid to tgid; i.e. the value stored at index i - * is the tgid last observed corresponding to pid=i. - */ -static int *tgid_map; - -/* The maximum valid index into tgid_map. */ -static size_t tgid_map_max; - -#define SAVED_CMDLINES_DEFAULT 128 -#define NO_CMDLINE_MAP UINT_MAX -/* - * Preemption must be disabled before acquiring trace_cmdline_lock. - * The various trace_arrays' max_lock must be acquired in a context - * where interrupt is disabled. - */ -static arch_spinlock_t trace_cmdline_lock = __ARCH_SPIN_LOCK_UNLOCKED; -struct saved_cmdlines_buffer { - unsigned map_pid_to_cmdline[PID_MAX_DEFAULT+1]; - unsigned *map_cmdline_to_pid; - unsigned cmdline_num; - int cmdline_idx; - char saved_cmdlines[]; -}; -static struct saved_cmdlines_buffer *savedcmd; - -/* Holds the size of a cmdline and pid element */ -#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s) \ - (TASK_COMM_LEN + sizeof((s)->map_cmdline_to_pid[0])) - -static inline char *get_saved_cmdlines(int idx) -{ - return >saved_cmdlines[idx * TASK_COMM_LEN]; -} - -static inline void set_cmdline(int idx, const char *cmdline) -{ - strncpy(get_saved_cmdlines(idx), cmdline, TASK_COMM_LEN); -} - -static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s) -{ - int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN); - - kmemleak_free(s); - free_pages((unsigned long)s, order); -} - -static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val) -{ - struct saved_cmdlines_buffer *s; - struct page *page; - int orig_size, size; - int order; - - /* Figure out how much is needed to hold the given number of cmdlines */ - orig_size = sizeof(*s) + val * SAVED_CMDLINE_MAP_ELEMENT_SIZE(s); - order = get_order(orig_size); - size = 1 << (order + PAGE_SHIFT); - page = alloc_pages(GFP_KERNEL, order); - if (!page) - return NULL; - - s = page_address(page); - kmemleak_alloc(s, size, 1, GFP_KERNEL); - memset(s, 0, sizeof(*s)); - - /* Round up to actual allocation */ - val = (size - sizeof(*s)) / SAVED_CMDLINE_MAP_ELEMENT_SIZE(s); - s->cmdline_num = val; - - /* Place map_cmdline_to_pid array right after saved_cmdlines */ - s->map_cmdline_to_pid = (unsigned *)>saved_cmdlines[val * TASK_COMM_LEN]; - - s->cmdline_idx = 0; - memset(>map_pid_to_cmdline, NO_CMDLINE_MAP, - sizeof(s->map_pid_to_cmdline)); - memset(s->map_cmdline_to_pid, NO_CMDLINE_MAP, - val * sizeof(*s->map_cmdline_to_pid)); - - return s; -} - -static int trace_create_savedcmd(void) -{ - savedcmd = allocate_cmdlines_buffer(SAVED_CMDLINES_DEFAULT); - - return savedcmd ? 0 : -ENOMEM; -} - int is_tracing_stopped(void) { return global_trace.stop_count; @@ -2481,201 +2390,6 @@ void tracing_stop(void) return tracing_stop_tr(_trace); } -static int trace_save_cmdline(struct task_struct *tsk) -{ - unsigned tpid, idx; - - /* treat recording of idle task as a success */ - if (!tsk->pid) - return 1; - - tpid = tsk->pid & (PID_MAX_DEFAULT - 1); - - /* -* It's not the end of the world if we don't get -* the lock, but we also don't want to spin -* nor do we want
[PATCH v4 2/3] tracing: Move open coded processing of tgid_map into helper function
From: "Steven Rostedt (Google)" In preparation of moving the saved_cmdlines logic out of trace.c and into trace_sched_switch.c, replace the open coded manipulation of tgid_map in set_tracer_flag() into a helper function trace_alloc_tgid_map() so that it can be easily moved into trace_sched_switch.c without changing existing functions in trace.c. No functional changes. Signed-off-by: Steven Rostedt (Google) --- kernel/trace/trace.c | 38 +++--- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 52faa30e64ed..06c593fc93d0 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -5432,10 +5432,31 @@ int trace_keep_overwrite(struct tracer *tracer, u32 mask, int set) return 0; } -int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled) +static int trace_alloc_tgid_map(void) { int *map; + if (tgid_map) + return 0; + + tgid_map_max = pid_max; + map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map), + GFP_KERNEL); + if (!map) + return -ENOMEM; + + /* +* Pairs with smp_load_acquire() in +* trace_find_tgid_ptr() to ensure that if it observes +* the tgid_map we just allocated then it also observes +* the corresponding tgid_map_max value. +*/ + smp_store_release(_map, map); + return 0; +} + +int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled) +{ if ((mask == TRACE_ITER_RECORD_TGID) || (mask == TRACE_ITER_RECORD_CMD)) lockdep_assert_held(_mutex); @@ -5458,20 +5479,7 @@ int set_tracer_flag(struct trace_array *tr, unsigned int mask, int enabled) trace_event_enable_cmd_record(enabled); if (mask == TRACE_ITER_RECORD_TGID) { - if (!tgid_map) { - tgid_map_max = pid_max; - map = kvcalloc(tgid_map_max + 1, sizeof(*tgid_map), - GFP_KERNEL); - - /* -* Pairs with smp_load_acquire() in -* trace_find_tgid_ptr() to ensure that if it observes -* the tgid_map we just allocated then it also observes -* the corresponding tgid_map_max value. -*/ - smp_store_release(_map, map); - } - if (!tgid_map) { + if (trace_alloc_tgid_map() < 0) { tr->trace_flags &= ~TRACE_ITER_RECORD_TGID; return -ENOMEM; } -- 2.43.0
[PATCH v4 1/3] tracing: Have saved_cmdlines arrays all in one allocation
From: "Steven Rostedt (Google)" The saved_cmdlines have three arrays for mapping PIDs to COMMs: - map_pid_to_cmdline[] - map_cmdline_to_pid[] - saved_cmdlines The map_pid_to_cmdline[] is PID_MAX_DEFAULT in size and holds the index into the other arrays. The map_cmdline_to_pid[] is a mapping back to the full pid as it can be larger than PID_MAX_DEFAULT. And the saved_cmdlines[] just holds the COMMs associated to the pids. Currently the map_pid_to_cmdline[] and saved_cmdlines[] are allocated together (in reality the saved_cmdlines is just in the memory of the rounding of the allocation of the structure as it is always allocated in powers of two). The map_cmdline_to_pid[] array is allocated separately. Since the rounding to a power of two is rather large (it allows for 8000 elements in saved_cmdlines), also include the map_cmdline_to_pid[] array. (This drops it to 6000 by default, which is still plenty for most use cases). This saves even more memory as the map_cmdline_to_pid[] array doesn't need to be allocated. Link: https://lore.kernel.org/linux-trace-kernel/20240212174011.06821...@gandalf.local.home/ Fixes: 44dc5c41b5b1 ("tracing: Fix wasted memory in saved_cmdlines logic") Acked-by: Masami Hiramatsu (Google) Signed-off-by: Steven Rostedt (Google) --- Changes since v4: https://lore.kernel.org/linux-trace-kernel/20240216210150.379809...@goodmis.org - Do not kfree() map_cmdline_to_pid now that it is included in the page allocations. kernel/trace/trace.c | 18 -- 1 file changed, 8 insertions(+), 10 deletions(-) diff --git a/kernel/trace/trace.c b/kernel/trace/trace.c index 8198bfc54b58..52faa30e64ed 100644 --- a/kernel/trace/trace.c +++ b/kernel/trace/trace.c @@ -2325,6 +2325,10 @@ struct saved_cmdlines_buffer { }; static struct saved_cmdlines_buffer *savedcmd; +/* Holds the size of a cmdline and pid element */ +#define SAVED_CMDLINE_MAP_ELEMENT_SIZE(s) \ + (TASK_COMM_LEN + sizeof((s)->map_cmdline_to_pid[0])) + static inline char *get_saved_cmdlines(int idx) { return >saved_cmdlines[idx * TASK_COMM_LEN]; @@ -2339,7 +2343,6 @@ static void free_saved_cmdlines_buffer(struct saved_cmdlines_buffer *s) { int order = get_order(sizeof(*s) + s->cmdline_num * TASK_COMM_LEN); - kfree(s->map_cmdline_to_pid); kmemleak_free(s); free_pages((unsigned long)s, order); } @@ -2352,7 +2355,7 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val) int order; /* Figure out how much is needed to hold the given number of cmdlines */ - orig_size = sizeof(*s) + val * TASK_COMM_LEN; + orig_size = sizeof(*s) + val * SAVED_CMDLINE_MAP_ELEMENT_SIZE(s); order = get_order(orig_size); size = 1 << (order + PAGE_SHIFT); page = alloc_pages(GFP_KERNEL, order); @@ -2364,16 +2367,11 @@ static struct saved_cmdlines_buffer *allocate_cmdlines_buffer(unsigned int val) memset(s, 0, sizeof(*s)); /* Round up to actual allocation */ - val = (size - sizeof(*s)) / TASK_COMM_LEN; + val = (size - sizeof(*s)) / SAVED_CMDLINE_MAP_ELEMENT_SIZE(s); s->cmdline_num = val; - s->map_cmdline_to_pid = kmalloc_array(val, - sizeof(*s->map_cmdline_to_pid), - GFP_KERNEL); - if (!s->map_cmdline_to_pid) { - free_saved_cmdlines_buffer(s); - return NULL; - } + /* Place map_cmdline_to_pid array right after saved_cmdlines */ + s->map_cmdline_to_pid = (unsigned *)>saved_cmdlines[val * TASK_COMM_LEN]; s->cmdline_idx = 0; memset(>map_pid_to_cmdline, NO_CMDLINE_MAP, -- 2.43.0
[PATCH v4 0/3] tracing: Changes to saved_cmdlines
The first patch is to consolidate the map_cmdline_to_pid into the allocate page for the saved_cmdlines structure. The second patch removes some open coded saved_cmdlines logic in trace.c into a helper function to make it a cleaner move. The third patch simply moves code around. The goal is to keep all the saved_cmdlines logic in one place. Currently it's in two files (trace.c and trace_sched_switch.c). Since trace.c is awfully large, move all the code to trace_sched_switch.c, as that's its only purpose today anyway. Changes since v3: https://lore.kernel.org/linux-trace-kernel/20240216210047.584712...@goodmis.org/ - The map_cmdline_to_pid field was moved into the pages allocated of the structure and that replaced the kmalloc. But that field still had kfree() called on it in the freeing of the structure which caused a memory corruption. Steven Rostedt (Google) (3): tracing: Have saved_cmdlines arrays all in one allocation tracing: Move open coded processing of tgid_map into helper function tracing: Move saved_cmdline code into trace_sched_switch.c kernel/trace/trace.c | 509 + kernel/trace/trace.h | 10 + kernel/trace/trace_sched_switch.c | 515 ++ 3 files changed, 528 insertions(+), 506 deletions(-)
Re: Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
On Tue, Feb 20, 2024 at 09:52:39AM +, "Huang, Kai" wrote: > I am not sure, but is it possible or legal for an ancestor to have less limit > than children? Why not? It is desired for proper hiearchical delegation and the tightest limit of ancestors applies (cf memory.max). Michal signature.asc Description: PGP signature
[PATCH] arm64: dts: qcom: sdm632-fairphone-fp3: enable USB-C port handling
Add the definition for the USB-C connector found on this phone and hook up the relevant bits. This enables USB role switching. Signed-off-by: Luca Weiss --- arch/arm64/boot/dts/qcom/msm8953.dtsi | 14 ++ arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts | 31 +-- 2 files changed, 43 insertions(+), 2 deletions(-) diff --git a/arch/arm64/boot/dts/qcom/msm8953.dtsi b/arch/arm64/boot/dts/qcom/msm8953.dtsi index 383657407c6f..6726d15c38c3 100644 --- a/arch/arm64/boot/dts/qcom/msm8953.dtsi +++ b/arch/arm64/boot/dts/qcom/msm8953.dtsi @@ -1330,6 +1330,20 @@ usb3_dwc3: usb@700 { snps,hird-threshold = /bits/ 8 <0x00>; maximum-speed = "high-speed"; + + usb-role-switch; + + ports { + #address-cells = <1>; + #size-cells = <0>; + + port@0 { + reg = <0>; + + usb_dwc3_hs: endpoint { + }; + }; + }; }; }; diff --git a/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts b/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts index c49a196189e3..2a65849f0da2 100644 --- a/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts +++ b/arch/arm64/boot/dts/qcom/sdm632-fairphone-fp3.dts @@ -162,6 +162,33 @@ led@3 { }; }; +_typec { + status = "okay"; + + connector { + compatible = "usb-c-connector"; + + power-role = "dual"; + data-role = "dual"; + self-powered; + + typec-power-opmode = "default"; + pd-disable; + + port { + pmi632_hs_in: endpoint { + remote-endpoint = <_dwc3_hs>; + }; + }; + }; +}; + +_vbus { + regulator-min-microamp = <50>; + regulator-max-microamp = <100>; + status = "okay"; +}; + _1 { status = "okay"; vmmc-supply = <_l8>; @@ -286,8 +313,8 @@ { status = "okay"; }; -_dwc3 { - dr_mode = "peripheral"; +_dwc3_hs { + remote-endpoint = <_hs_in>; }; { --- base-commit: 103eb8e019aefd616735200ce46833bc74cfe132 change-id: 20240220-fp3-typec-25eb002db8b5 Best regards, -- Luca Weiss
[PATCH v19 3/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
From: Ankit Agrawal NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. This device memory aperture can be obtained from host ACPI table using device_property_read_u64(), according to the FW specification. Since the device memory is cache coherent with the CPU, it can be mmap into the user VMA with a cacheable mapping using remap_pfn_range() and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a minimum reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communicate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--| |-cached|--NC--| | | usemem.memphys resmem.memphys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and VFIO driver. The VM device driver make use of the same value for its calculation to determine USEMEM size. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [3] and [4], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present, the driver registers the vfio-pci-core function pointers. Since there are no ACPI memory properties generated for the VM, the variant driver inside the VM will only use the vfio-pci-core ops
[PATCH v19 2/3] vfio/pci: rename and export range_intersect_range
From: Ankit Agrawal range_intersect_range determines an overlap between two ranges. If an overlap, the helper function returns the overlapping offset and size. The VFIO PCI variant driver emulates the PCI config space BAR offset registers. These offset may be accessed for read/write with a variety of lengths including sub-word sizes from sub-word offsets. The driver makes use of this helper function to read/write the targeted part of the emulated register. Make this a vfio_pci_core function, rename and export as GPL. Also update references in virtio driver. Reviewed-by: Kevin Tian Reviewed-by: Yishai Hadas Signed-off-by: Ankit Agrawal --- drivers/vfio/pci/vfio_pci_config.c | 42 + drivers/vfio/pci/virtio/main.c | 72 +++--- include/linux/vfio_pci_core.h | 5 +++ 3 files changed, 73 insertions(+), 46 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_config.c b/drivers/vfio/pci/vfio_pci_config.c index 672a1804af6a..971e32bc0bb4 100644 --- a/drivers/vfio/pci/vfio_pci_config.c +++ b/drivers/vfio/pci/vfio_pci_config.c @@ -1966,3 +1966,45 @@ ssize_t vfio_pci_config_rw(struct vfio_pci_core_device *vdev, char __user *buf, return done; } + +/** + * vfio_pci_core_range_intersect_range() - Determine overlap between a buffer + *and register offset ranges. + * @buf_start: start offset of the buffer + * @buf_cnt: number of buffer bytes + * @reg_start: start register offset + * @reg_cnt: number of register bytes + * @buf_offset:start offset of overlap in the buffer + * @intersect_count: number of overlapping bytes + * @register_offset: start offset of overlap in register + * + * Returns: true if there is overlap, false if not. + * The overlap start and size is returned through function args. + */ +bool vfio_pci_core_range_intersect_range(loff_t buf_start, size_t buf_cnt, +loff_t reg_start, size_t reg_cnt, +loff_t *buf_offset, +size_t *intersect_count, +size_t *register_offset) +{ + if (buf_start <= reg_start && + buf_start + buf_cnt > reg_start) { + *buf_offset = reg_start - buf_start; + *intersect_count = min_t(size_t, reg_cnt, +buf_start + buf_cnt - reg_start); + *register_offset = 0; + return true; + } + + if (buf_start > reg_start && + buf_start < reg_start + reg_cnt) { + *buf_offset = 0; + *intersect_count = min_t(size_t, buf_cnt, +reg_start + reg_cnt - buf_start); + *register_offset = buf_start - reg_start; + return true; + } + + return false; +} +EXPORT_SYMBOL_GPL(vfio_pci_core_range_intersect_range); diff --git a/drivers/vfio/pci/virtio/main.c b/drivers/vfio/pci/virtio/main.c index d5af683837d3..b5d3a8c5bbc9 100644 --- a/drivers/vfio/pci/virtio/main.c +++ b/drivers/vfio/pci/virtio/main.c @@ -132,33 +132,6 @@ virtiovf_pci_bar0_rw(struct virtiovf_pci_core_device *virtvdev, return ret ? ret : count; } -static bool range_intersect_range(loff_t range1_start, size_t count1, - loff_t range2_start, size_t count2, - loff_t *start_offset, - size_t *intersect_count, - size_t *register_offset) -{ - if (range1_start <= range2_start && - range1_start + count1 > range2_start) { - *start_offset = range2_start - range1_start; - *intersect_count = min_t(size_t, count2, -range1_start + count1 - range2_start); - *register_offset = 0; - return true; - } - - if (range1_start > range2_start && - range1_start < range2_start + count2) { - *start_offset = 0; - *intersect_count = min_t(size_t, count1, -range2_start + count2 - range1_start); - *register_offset = range1_start - range2_start; - return true; - } - - return false; -} - static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev, char __user *buf, size_t count, loff_t *ppos) @@ -178,16 +151,18 @@ static ssize_t virtiovf_pci_read_config(struct vfio_device *core_vdev, if (ret < 0) return ret; - if (range_intersect_range(pos, count, PCI_DEVICE_ID, sizeof(val16), - _offset, _count, _offset)) { + if (vfio_pci_core_range_intersect_range(pos, count, PCI_DEVICE_ID, +
[PATCH v19 1/3] vfio/pci: rename and export do_io_rw()
From: Ankit Agrawal do_io_rw() is used to read/write to the device MMIO. The grace hopper VFIO PCI variant driver require this functionality to read/write to its memory. Rename this as vfio_pci_core functions and export as GPL. Reviewed-by: Kevin Tian Reviewed-by: Yishai Hadas Signed-off-by: Ankit Agrawal --- drivers/vfio/pci/vfio_pci_rdwr.c | 16 +--- include/linux/vfio_pci_core.h| 5 - 2 files changed, 13 insertions(+), 8 deletions(-) diff --git a/drivers/vfio/pci/vfio_pci_rdwr.c b/drivers/vfio/pci/vfio_pci_rdwr.c index 07fea08ea8a2..03b8f7ada1ac 100644 --- a/drivers/vfio/pci/vfio_pci_rdwr.c +++ b/drivers/vfio/pci/vfio_pci_rdwr.c @@ -96,10 +96,10 @@ VFIO_IOREAD(32) * reads with -1. This is intended for handling MSI-X vector tables and * leftover space for ROM BARs. */ -static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, - void __iomem *io, char __user *buf, - loff_t off, size_t count, size_t x_start, - size_t x_end, bool iswrite) +ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, + void __iomem *io, char __user *buf, + loff_t off, size_t count, size_t x_start, + size_t x_end, bool iswrite) { ssize_t done = 0; int ret; @@ -201,6 +201,7 @@ static ssize_t do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, return done; } +EXPORT_SYMBOL_GPL(vfio_pci_core_do_io_rw); int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar) { @@ -279,8 +280,8 @@ ssize_t vfio_pci_bar_rw(struct vfio_pci_core_device *vdev, char __user *buf, x_end = vdev->msix_offset + vdev->msix_size; } - done = do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos, - count, x_start, x_end, iswrite); + done = vfio_pci_core_do_io_rw(vdev, res->flags & IORESOURCE_MEM, io, buf, pos, + count, x_start, x_end, iswrite); if (done >= 0) *ppos += done; @@ -348,7 +349,8 @@ ssize_t vfio_pci_vga_rw(struct vfio_pci_core_device *vdev, char __user *buf, * probing, so we don't currently worry about access in relation * to the memory enable bit in the command register. */ - done = do_io_rw(vdev, false, iomem, buf, off, count, 0, 0, iswrite); + done = vfio_pci_core_do_io_rw(vdev, false, iomem, buf, off, count, + 0, 0, iswrite); vga_put(vdev->pdev, rsrc); diff --git a/include/linux/vfio_pci_core.h b/include/linux/vfio_pci_core.h index 85e84b92751b..cf9480a31f3e 100644 --- a/include/linux/vfio_pci_core.h +++ b/include/linux/vfio_pci_core.h @@ -130,7 +130,10 @@ void vfio_pci_core_finish_enable(struct vfio_pci_core_device *vdev); int vfio_pci_core_setup_barmap(struct vfio_pci_core_device *vdev, int bar); pci_ers_result_t vfio_pci_core_aer_err_detected(struct pci_dev *pdev, pci_channel_state_t state); - +ssize_t vfio_pci_core_do_io_rw(struct vfio_pci_core_device *vdev, bool test_mem, + void __iomem *io, char __user *buf, + loff_t off, size_t count, size_t x_start, + size_t x_end, bool iswrite); #define VFIO_IOWRITE_DECLATION(size) \ int vfio_pci_core_iowrite##size(struct vfio_pci_core_device *vdev, \ bool test_mem, u##size val, void __iomem *io); -- 2.34.1
[PATCH v19 0/3] vfio/nvgrace-gpu: Add vfio pci variant module for grace hopper
From: Ankit Agrawal NVIDIA's upcoming Grace Hopper Superchip provides a PCI-like device for the on-chip GPU that is the logical OS representation of the internal proprietary chip-to-chip cache coherent interconnect. The device is peculiar compared to a real PCI device in that whilst there is a real 64b PCI BAR1 (comprising region 2 & region 3) on the device, it is not used to access device memory once the faster chip-to-chip interconnect is initialized (occurs at the time of host system boot). The device memory is accessed instead using the chip-to-chip interconnect that is exposed as a contiguous physically addressable region on the host. Since the device memory is cache coherent with the CPU, it can be mmaped into the user VMA with a cacheable mapping and used like a regular RAM. The device memory is not added to the host kernel, but mapped directly as this reduces memory wastage due to struct pages. There is also a requirement of a minimum reserved 1G uncached region (termed as resmem) to support the Multi-Instance GPU (MIG) feature [1]. This is to work around a HW defect. Based on [2], the requisite properties (uncached, unaligned access) can be achieved through a VM mapping (S1) of NORMAL_NC and host (S2) mapping with MemAttr[2:0]=0b101. To provide a different non-cached property to the reserved 1G region, it needs to be carved out from the device memory and mapped as a separate region in Qemu VMA with pgprot_writecombine(). pgprot_writecombine() sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Provide a VFIO PCI variant driver that adapts the unique device memory representation into a more standard PCI representation facing userspace. The variant driver exposes these two regions - the non-cached reserved (resmem) and the cached rest of the device memory (termed as usemem) as separate VFIO 64b BAR regions. This is divergent from the baremetal approach, where the device memory is exposed as a device memory region. The decision for a different approach was taken in view of the fact that it would necessiate additional code in Qemu to discover and insert those regions in the VM IPA, along with the additional VM ACPI DSDT changes to communiate the device memory region IPA to the VM workloads. Moreover, this behavior would have to be added to a variety of emulators (beyond top of tree Qemu) out there desiring grace hopper support. Since the device implements 64-bit BAR0, the VFIO PCI variant driver maps the uncached carved out region to the next available PCI BAR (i.e. comprising of region 2 and 3). The cached device memory aperture is assigned BAR region 4 and 5. Qemu will then naturally generate a PCI device in the VM with the uncached aperture reported as BAR2 region, the cacheable as BAR4. The variant driver provides emulation for these fake BARs' PCI config space offset registers. The hardware ensures that the system does not crash when the memory is accessed with the memory enable turned off. It synthesis ~0 reads and dropped writes on such access. So there is no need to support the disablement/enablement of BAR through PCI_COMMAND config space register. The memory layout on the host looks like the following: devmem (memlength) |--| |-cached|--NC--| | | usemem.memphys resmem.memphys PCI BARs need to be aligned to the power-of-2, but the actual memory on the device may not. A read or write access to the physical address from the last device PFN up to the next power-of-2 aligned physical address results in reading ~0 and dropped writes. Note that the GPU device driver [6] is capable of knowing the exact device memory size through separate means. The device memory size is primarily kept in the system ACPI tables for use by the VFIO PCI variant module. Note that the usemem memory is added by the VM Nvidia device driver [5] to the VM kernel as memblocks. Hence make the usable memory size memblock (MEMBLK_SIZE) aligned. This is a hardwired ABI value between the GPU FW and VFIO driver. The VM device driver make use of the same value for its calculation to determine USEMEM size. Currently there is no provision in KVM for a S2 mapping with MemAttr[2:0]=0b101, but there is an ongoing effort to provide the same [3]. As previously mentioned, resmem is mapped pgprot_writecombine(), that sets the Qemu VMA page properties (pgprot) as NORMAL_NC. Using the proposed changes in [3] and [4], KVM marks the region with MemAttr[2:0]=0b101 in S2. If the device memory properties are not present, the driver registers the vfio-pci-core function pointers. Since there are no ACPI memory properties generated for the VM, the variant driver inside the VM will only use the vfio-pci-core ops and hence try to map the BARs as non cached. This is not a problem as the CPUs have FWB enabled which blocks the VM mapping's ability to override the
Re: [PATCH v9 08/15] x86/sgx: Implement EPC reclamation flows for cgroup
> +/* > + * Get the lower bound of limits of a cgroup and its ancestors. Used in > + * sgx_epc_cgroup_reclaim_work_func() to determine if EPC usage of a cgroup > is > + * over its limit or its ancestors' hence reclamation is needed. > + */ > +static inline u64 sgx_epc_cgroup_max_pages_to_root(struct sgx_epc_cgroup > *epc_cg) > +{ > + struct misc_cg *i = epc_cg->cg; > + u64 m = U64_MAX; > + > + while (i) { > + m = min(m, READ_ONCE(i->res[MISC_CG_RES_SGX_EPC].max)); > + i = misc_cg_parent(i); > + } > + > + return m / PAGE_SIZE; > +} I am not sure, but is it possible or legal for an ancestor to have less limit than children? > + > /** > - * sgx_epc_cgroup_try_charge() - try to charge cgroup for a single EPC page > + * sgx_epc_cgroup_lru_empty() - check if a cgroup tree has no pages on its > LRUs > + * @root:Root of the tree to check > * > + * Return: %true if all cgroups under the specified root have empty LRU > lists. > + * Used to avoid livelocks due to a cgroup having a non-zero charge count but > + * no pages on its LRUs, e.g. due to a dead enclave waiting to be released or > + * because all pages in the cgroup are unreclaimable. > + */ > +bool sgx_epc_cgroup_lru_empty(struct misc_cg *root) > +{ > + struct cgroup_subsys_state *css_root; > + struct cgroup_subsys_state *pos; > + struct sgx_epc_cgroup *epc_cg; > + bool ret = true; > + > + /* > + * Caller ensure css_root ref acquired > + */ > + css_root = >css; > + > + rcu_read_lock(); > + css_for_each_descendant_pre(pos, css_root) { > + if (!css_tryget(pos)) > + break; > + > + rcu_read_unlock(); > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); > + > + spin_lock(_cg->lru.lock); > + ret = list_empty(_cg->lru.reclaimable); > + spin_unlock(_cg->lru.lock); > + > + rcu_read_lock(); > + css_put(pos); > + if (!ret) > + break; > + } > + > + rcu_read_unlock(); > + > + return ret; > +} > + > +/** > + * sgx_epc_cgroup_reclaim_pages() - walk a cgroup tree and scan LRUs to > reclaim pages > + * @root:Root of the tree to start walking from. > + * Return: Number of pages reclaimed. Just wondering, do you need to return @cnt given this function is called w/o checking the return value? > + */ > +unsigned int sgx_epc_cgroup_reclaim_pages(struct misc_cg *root) > +{ > + /* > + * Attempting to reclaim only a few pages will often fail and is > + * inefficient, while reclaiming a huge number of pages can result in > + * soft lockups due to holding various locks for an extended duration. > + */ Not sure we need this comment, given it's already implied in sgx_reclaim_pages(). You cannot pass a value > SGX_NR_TO_SCAN anyway. > + unsigned int nr_to_scan = SGX_NR_TO_SCAN; > + struct cgroup_subsys_state *css_root; > + struct cgroup_subsys_state *pos; > + struct sgx_epc_cgroup *epc_cg; > + unsigned int cnt; > + > + /* Caller ensure css_root ref acquired */ > + css_root = >css; > + > + cnt = 0; > + rcu_read_lock(); > + css_for_each_descendant_pre(pos, css_root) { > + if (!css_tryget(pos)) > + break; > + rcu_read_unlock(); > + > + epc_cg = sgx_epc_cgroup_from_misc_cg(css_misc(pos)); > + cnt += sgx_reclaim_pages(_cg->lru, _to_scan); > + > + rcu_read_lock(); > + css_put(pos); > + if (!nr_to_scan) > + break; > + } > + > + rcu_read_unlock(); > + return cnt; > +} Here the @nr_to_scan is reduced by the number of pages that are isolated, but not actually reclaimed (which is reflected by @cnt). IIUC, looks you want to make this function do "each cycle" as what you mentioned in the v8 [1]: I tested with that approach and found we can only target number of pages attempted to reclaim not pages actually reclaimed due to the uncertainty of how long it takes to reclaim pages. Besides targeting number of scanned pages for each cycle is also what the ksgxd does. If we target actual number of pages, sometimes it just takes too long. I saw more timeouts with the default time limit when running parallel selftests. I am not sure what does "sometimes it just takes too long" mean, but what I am thinking is you are trying to do some perfect but yet complicated code here. For instance, I don't think selftest reflect the real workload, and I believe adjusting the limit of a given EPC cgroup shouldn't be a frequent operation, thus it is acceptable to use some easy-maintain code but less perfect code. Here I still think having @nr_to_scan as a pointer is over-complicated. For example, we can still let sgx_reclaim_pages() to always scan SGX_NR_TO_SCAN pages, but give up
Re: [PATCH v9 07/15] x86/sgx: Expose sgx_reclaim_pages() for cgroup
On Mon, 2024-02-05 at 13:06 -0800, Haitao Huang wrote: > From: Sean Christopherson > > Each EPC cgroup will have an LRU structure to track reclaimable EPC pages. > When a cgroup usage reaches its limit, the cgroup needs to reclaim pages > from its LRU or LRUs of its descendants to make room for any new > allocations. > > To prepare for reclamation per cgroup, expose the top level reclamation > function, sgx_reclaim_pages(), in header file for reuse. Add a parameter > to the function to pass in an LRU so cgroups can pass in different > tracking LRUs later. > [...] > Add another parameter for passing in the number of > pages to scan and make the function return the number of pages reclaimed > as a cgroup reclaimer may need to track reclamation progress from its > descendants, change number of pages to scan in subsequent calls. Firstly, sorry for late reply as I was away. From the changelog, it's understandable you want to make this function return pages that are actually reclaimed, and perhaps it's also OK to pass the number of pages to scan. But this doesn't explain why you need to make @nr_to_scan as a pointer, while you are returning the number of pages that are actually reclaimed? And ... [...] > -/* > - * Take a fixed number of pages from the head of the active page pool and > - * reclaim them to the enclave's private shmem files. Skip the pages, which > have > - * been accessed since the last scan. Move those pages to the tail of active > - * page pool so that the pages get scanned in LRU like fashion. > +/** > + * sgx_reclaim_pages() - Reclaim a fixed number of pages from an LRU > + * > + * Take a fixed number of pages from the head of a given LRU and reclaim > them to > + * the enclave's private shmem files. Skip the pages, which have been > accessed > + * since the last scan. Move those pages to the tail of the list so that the > + * pages get scanned in LRU like fashion. > * > * Batch process a chunk of pages (at the moment 16) in order to degrade > amount ... there's no comment to explain such design either (@nr_to_scan as a pointer). Btw, with this change, seems "Take a fixed number of pages ..." and "at the moment 16" are not accurate any more. > * of IPI's and ETRACK's potentially required. sgx_encl_ewb() does degrade a > bit > @@ -298,8 +300,13 @@ static void sgx_reclaimer_write(struct sgx_epc_page > *epc_page, > * + EWB) but not sufficiently. Reclaiming one page at a time would also be > * problematic as it would increase the lock contention too much, which would > * halt forward progress. > + * > + * @lru: The LRU from which pages are reclaimed. > + * @nr_to_scan: Pointer to the target number of pages to scan, must be less > than > + * SGX_NR_TO_SCAN. > + * Return: Number of pages reclaimed. > */ > -static void sgx_reclaim_pages(void) > +unsigned int sgx_reclaim_pages(struct sgx_epc_lru_list *lru, unsigned int > *nr_to_scan) > { > struct sgx_epc_page *chunk[SGX_NR_TO_SCAN]; > struct sgx_backing backing[SGX_NR_TO_SCAN]; > @@ -310,10 +317,10 @@ static void sgx_reclaim_pages(void) > int ret; > int i; > > - spin_lock(_global_lru.lock); > - for (i = 0; i < SGX_NR_TO_SCAN; i++) { > - epc_page = list_first_entry_or_null(_global_lru.reclaimable, > - struct sgx_epc_page, list); > + spin_lock(>lock); > + > + for (; *nr_to_scan > 0; --(*nr_to_scan)) { > + epc_page = list_first_entry_or_null(>reclaimable, struct > sgx_epc_page, list); > if (!epc_page) > break; > > @@ -328,7 +335,8 @@ static void sgx_reclaim_pages(void) >*/ > epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; > } > - spin_unlock(_global_lru.lock); > + > + spin_unlock(>lock); > > for (i = 0; i < cnt; i++) { > epc_page = chunk[i]; > @@ -351,9 +359,9 @@ static void sgx_reclaim_pages(void) > continue; > > skip: > - spin_lock(_global_lru.lock); > - list_add_tail(_page->list, _global_lru.reclaimable); > - spin_unlock(_global_lru.lock); > + spin_lock(>lock); > + list_add_tail(_page->list, >reclaimable); > + spin_unlock(>lock); > > kref_put(_page->encl->refcount, sgx_encl_release); > > @@ -366,6 +374,7 @@ static void sgx_reclaim_pages(void) > sgx_reclaimer_block(epc_page); > } > > + ret = 0; > for (i = 0; i < cnt; i++) { > epc_page = chunk[i]; > if (!epc_page) > @@ -378,7 +387,10 @@ static void sgx_reclaim_pages(void) > epc_page->flags &= ~SGX_EPC_PAGE_RECLAIMER_TRACKED; > > sgx_free_epc_page(epc_page); > + ret++; > } > + > + return (unsigned int)ret; > } > Here basically the @nr_to_scan is reduced by the number of pages that are
Re: [PATCH] vduse: implement DMA sync callbacks
On Mon, Feb 19, 2024 at 06:06:06PM +0100, Maxime Coquelin wrote: > Since commit 295525e29a5b ("virtio_net: merge dma > operations when filling mergeable buffers"), VDUSE device > require support for DMA's .sync_single_for_cpu() operation > as the memory is non-coherent between the device and CPU > because of the use of a bounce buffer. > > This patch implements both .sync_single_for_cpu() and > sync_single_for_device() callbacks, and also skip bounce > buffer copies during DMA map and unmap operations if the > DMA_ATTR_SKIP_CPU_SYNC attribute is set to avoid extra > copies of the same buffer. vduse really needs to get out of implementing fake DMA operations for something that is not DMA.
Re: WARNING in shmem_release_dquot
On Mon, Feb 19, 2024 at 08:26:20PM -0800, Hugh Dickins wrote: > On Mon, 29 Jan 2024, Ubisectech Sirius wrote: > > > Hello. > > We are Ubisectech Sirius Team, the vulnerability lab of China ValiantSec. > > Recently, our team has discovered a issue in Linux kernel > > 6.8.0-rc1-gecb1b8288dc7. Attached to the email were a POC file of the issue. > > > > Stack dump: > > [ 246.195553][ T4096] [ cut here ] > > [ 246.196540][ T4096] quota id 16384 from dquot 888051bd3000, not in > > rb tree! > > [ 246.198829][ T4096] WARNING: CPU: 1 PID: 4096 at mm/shmem_quota.c:290 > > shmem_release_dquot (mm/shmem_quota.c:290 (discriminator 3)) > > [ 246.199955][ T4096] Modules linked in: > > [ 246.200435][ T4096] CPU: 1 PID: 4096 Comm: kworker/u6:6 Not tainted > > 6.8.0-rc1-gecb1b8288dc7 #21 > > [ 246.201566][ T4096] Hardware name: QEMU Standard PC (i440FX + PIIX, > > 1996), BIOS 1.15.0-1 04/01/2014 > > [ 246.202667][ T4096] Workqueue: events_unbound quota_release_workfn > > [ 246.203516][ T4096] RIP: 0010:shmem_release_dquot (mm/shmem_quota.c:290 > > (discriminator 3)) > > [ 246.204276][ T4096] Code: e8 28 d9 18 00 e9 b3 f8 ff ff e8 6e e1 c2 ff c6 > > 05 bf e8 1b 0d 01 90 48 c7 c7 80 f0 b8 8a 4c 89 ea 44 89 e6 e8 14 6d 89 ff > > 90 <0f> 0b 90 90 e9 18 fb ff ff e8 f5 d8 18 00 e9 a2 fa ff ff e8 0b d9 > > All code > > > >0: e8 28 d9 18 00 call 0x18d92d > >5: e9 b3 f8 ff ff jmp0xf8bd > >a: e8 6e e1 c2 ff call 0xffc2e17d > >f: c6 05 bf e8 1b 0d 01movb $0x1,0xd1be8bf(%rip)# > > 0xd1be8d5 > > 16: 90 nop > > 17: 48 c7 c7 80 f0 b8 8amov$0x8ab8f080,%rdi > > 1e: 4c 89 eamov%r13,%rdx > > 21: 44 89 e6mov%r12d,%esi > > 24: e8 14 6d 89 ff call 0xff896d3d > > 29: 90 nop > > 2a:* 0f 0b ud2 <-- trapping instruction > > 2c: 90 nop > > 2d: 90 nop > > 2e: e9 18 fb ff ff jmp0xfb4b > > 33: e8 f5 d8 18 00 call 0x18d92d > > 38: e9 a2 fa ff ff jmp0xfadf > > 3d: e8 .byte 0xe8 > > 3e: 0b d9 or %ecx,%ebx > > > > Code starting with the faulting instruction > > === > >0: 0f 0b ud2 > >2: 90 nop > >3: 90 nop > >4: e9 18 fb ff ff jmp0xfb21 > >9: e8 f5 d8 18 00 call 0x18d903 > >e: e9 a2 fa ff ff jmp0xfab5 > > 13: e8 .byte 0xe8 > > 14: 0b d9 or %ecx,%ebx > > [ 246.206640][ T4096] RSP: 0018:c9000604fbc0 EFLAGS: 00010286 > > [ 246.207403][ T4096] RAX: RBX: RCX: > > 814c77da > > [ 246.208514][ T4096] RDX: 888049a58000 RSI: 814c77e7 RDI: > > 0001 > > [ 246.209429][ T4096] RBP: R08: 0001 R09: > > > > [ 246.210362][ T4096] R10: 0001 R11: 0001 R12: > > 4000 > > [ 246.211367][ T4096] R13: 888051bd3000 R14: dc00 R15: > > 888051bd3040 > > [ 246.212327][ T4096] FS: () > > GS:88807ec0() knlGS: > > [ 246.213387][ T4096] CS: 0010 DS: ES: CR0: 80050033 > > [ 246.214232][ T4096] CR2: 7ffee748ec80 CR3: 0cb78000 CR4: > > 00750ef0 > > [ 246.215216][ T4096] DR0: DR1: DR2: > > > > [ 246.216187][ T4096] DR3: DR6: fffe0ff0 DR7: > > 0400 > > [ 246.217148][ T4096] PKRU: 5554 > > [ 246.217615][ T4096] Call Trace: > > [ 246.218090][ T4096] > > [ 246.218467][ T4096] ? show_regs (arch/x86/kernel/dumpstack.c:479) > > [ 246.218979][ T4096] ? __warn (kernel/panic.c:677) > > [ 246.219505][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 > > (discriminator 3)) > > [ 246.220197][ T4096] ? report_bug (lib/bug.c:201 lib/bug.c:219) > > [ 246.220775][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 > > (discriminator 3)) > > [ 246.221500][ T4096] ? handle_bug (arch/x86/kernel/traps.c:238) > > [ 246.222081][ T4096] ? exc_invalid_op (arch/x86/kernel/traps.c:259 > > (discriminator 1)) > > [ 246.222687][ T4096] ? asm_exc_invalid_op > > (./arch/x86/include/asm/idtentry.h:568) > > [ 246.223296][ T4096] ? __warn_printk > > (./include/linux/context_tracking.h:155 kernel/panic.c:726) > > [ 246.223878][ T4096] ? __warn_printk (kernel/panic.c:717) > > [ 246.224460][ T4096] ? shmem_release_dquot (mm/shmem_quota.c:290 > > (discriminator 3)) > > [ 246.225125][ T4096] quota_release_workfn