Re: [PATCH v2 00/16] Multigenerational LRU Framework
> We fall back to the rmap when it's obviously not smart to do so. There > is still a lot of room for improvement in this function though, i.e., > it should be per VMA and NUMA aware. Okay so it's more a question to tune the cross over heuristic. That sounds much easier than replacing everything. Of course long term it might be a problem to maintain too many different ways to do things, but I suppose short term it's a reasonable strategy. -Andi
Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
> The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a > result, I think a module parameter will be no different from a kernel > command line parameter in this regard. You can still change it in /sys at runtime, even if it's in the vmlinux. -Andi
Re: [PATCH v2 00/16] Multigenerational LRU Framework
> Now imagine we have an 8 node system, and memory > pressure in the DMA32 zone of node 0. The question is how much do we still care about DMA32. If there are problems they can probably just turn on the IOMMU for these IO mappings. -Andi
Re: [PATCH v2 00/16] Multigenerational LRU Framework
>2) It will not scan PTE tables under non-leaf PMD entries that do not > have the accessed bit set, when > CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y. This assumes that workloads have reasonable locality. Could there be a worst case where only one or two pages in each PTE are used, so this PTE skipping trick doesn't work? -Andi
Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
> > ms granularity seems very coarse grained for this. Surely > > at some point of spinning you can afford a ktime_get? But ok. > We are reading time when we are at the head of the (main) queue, but > don’t have the lock yet. Not sure about the latency of ktime_get(), but > anything reasonably fast but not necessarily precise should work. Actually cpu_clock / sched_clock (see my other email). These should be fast without corner cases and also monotonic. > > > Could you turn that into a moduleparm which can be changed at runtime? > > Would be strange to have to reboot just to play with this parameter > Yes, good suggestion, thanks. > > > This would also make the code a lot shorter I guess. > So you don’t think we need the command-line parameter, just the module_param? module_params can be changed at the command line too, so yes. -Andi
Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
Andi Kleen writes: > Alex Kogan writes: >> >> +numa_spinlock_threshold=[NUMA, PV_OPS] >> +Set the time threshold in milliseconds for the >> +number of intra-node lock hand-offs before the >> +NUMA-aware spinlock is forced to be passed to >> +a thread on another NUMA node. Valid values >> +are in the [1..100] range. Smaller values result >> +in a more fair, but less performant spinlock, >> +and vice versa. The default value is 10. > > ms granularity seems very coarse grained for this. Surely > at some point of spinning you can afford a ktime_get? But ok. Actually thinking about it more using jiffies is likely broken anyways because if the interrupts are disabled and the CPU is running the main timer interrupts they won't increase. cpu_clock (better than ktime_get) or sched_clock would work. -Andi
Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA
Alex Kogan writes: > > + numa_spinlock_threshold=[NUMA, PV_OPS] > + Set the time threshold in milliseconds for the > + number of intra-node lock hand-offs before the > + NUMA-aware spinlock is forced to be passed to > + a thread on another NUMA node. Valid values > + are in the [1..100] range. Smaller values result > + in a more fair, but less performant spinlock, > + and vice versa. The default value is 10. ms granularity seems very coarse grained for this. Surely at some point of spinning you can afford a ktime_get? But ok. Could you turn that into a moduleparm which can be changed at runtime? Would be strange to have to reboot just to play with this parameter This would also make the code a lot shorter I guess. -Andi
Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
> The reason why soft lockup happens may be the unmapped EPT pages. So, do we > have a way to map all gpa > before we use pebs on Skylake? Can you configure a VT-d device, that will implicitly pin all pages for the IOMMU. I *think* that should be enough for testing. -Andi
Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally
> I have actually seen real user programs poke MSR_SYSCALL_MASK. Hmm, what was the use case? -Andi
Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally
Borislav Petkov writes: > From: Borislav Petkov > Date: Sat, 10 Apr 2021 14:08:13 +0200 > > There are a bunch of MSRs which luserspace has no business poking at, > whatsoever. Add a ban list and put the TSC-related MSRs in there. Issue > a big juicy splat to catch offenders. Have you ever seen any user programs actually write those MSRs? I don't see why they ever would, it's not that they have any motivation to do it (unlike SMM), and I don't know of any examples. The whole MSR blocking seems more like a tilting at windmills type effort. Root kits typically write from the kernel anyways. And the only results we have so far is various legitimate debug and benchmark utilities running much slower due to them flooding the kernel log with warnings. I can see that there are security reasons to lock down MSRs, but that is already handled fine with existing sandbox and lockdown mechanisms. But on a non locked down system fully accessible MSRs are really useful for all kind of debugging and tuning, and anything that prevents that is bad. -Andi
Re: [PATCH v4 12/12] perf session: use reader functions to load perf data file
Except where I commented, for the series Acked-by: Andi Kleen -Andi
Re: [PATCH v4 09/12] perf record: document parallel data streaming mode
> +--threads=:: > +Write collected trace data into several data files using parallel threads. > + value can be user defined list of masks. Masks separated by colon > +define cpus to be monitored by a thread and affinity mask of that thread > +is separated by slash. For example user specification like the following: > +/:/ specifies You need to be more clear on the exact syntax of a mask. Ideally some full examples too. > +parallel threads layout that consists of two threads with corresponding > +assigned cpus to be monitored. value can also be a string meaning > +predefined parallel threads layout: > +cpu- create new data streaming thread for every monitored cpu > +core - create new thread to monitor cpus grouped by a core > +socket - create new thread to monitor cpus grouped by a socket > +numa - create new threed to monitor cpus grouped by a numa domain > +Predefined layouts can be used on systems with large number of cpus in > +order not to spawn multiple per-cpu streaming threads but still avoid LOST > +events in data directory files. Option specified with no or empty value > +defaults to cpu layout. Masks defined or provided by the option value are > +filtered through the mask provided by -C option. > > >
Re: [PATCH v4 05/12] perf record: start threads in the beginning of trace streaming
> + err = write(thread->pipes.ack[1], , sizeof(msg)); > + if (err == -1) > + pr_err("threads[%d]: failed to notify on start. Error %m", > thread->tid); It might be safer to not use %m. I'm not sure if all the non glibc libcs that people use support it.
Re: [PATCH v4 02/12] perf record: introduce thread specific data array
> + } else { > + thread_data[t].tid = syscall(SYS_gettid); That always fills in the tid of the setup thread instead of the target threads?
Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
> Hmm, I forgot about that quirk. I would expect the TDX Module to inject a #GP > for that case. I can't find anything in the spec that confirms or denies > that, > but injecting #VE would be weird and pointless. > > Andi/Sathya, the TDX Module spec should be updated to state that XSETBV will > #GP at CPL!=0. If that's not already the behavior, the module should probably > be changed... I asked about this and the answer was that XSETBV behaves architecturally inside a TD (no #VE), thus there is nothing to document. -Andi
Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
David Hildenbrand writes: > I have no idea how expensive would be bouncing writes (and reads?) > through the kernel. Did you ever experiment with that/evaluate that? I would expect it to be quite expensive, as in virtio IO performance tanking. -Andi
Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
On Wed, Apr 07, 2021 at 11:05:20AM +0800, Liuxiangdong (Aven, Cloud Infrastructure Service Product Dept.) wrote: > > > On 2021/4/6 20:47, Andi Kleen wrote: > > > AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake > > > doesn't. > > > But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose > > > counter in Skylake. > > > Is there anything else that only Icelake supports in this patches set? > > Only Icelake server has the support for recovering from a EPT violation > > on the PEBS data structures. To use it on Skylake server you would > > need to pin the whole guest, but that is currently not done. > Sorry. Some questions about "Pin the whole guest". Do you mean VmPin equals > VmSize > in "/proc/$(pidof qemu-kvm)/status"? Or just VmLck equals VmSize? Or > something else? Either would be sufficient. All that matters is that the EPT pages don't get unmapped ever while PEBS is active. -Andi
Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages
Christophe de Dinechin writes: > Is there even a theoretical way to restore an encrypted page e.g. from (host) > swap without breaking the integrity check? Or will that only be possible with > assistance from within the encrypted enclave? Only the later. You would need balloning. It's in principle possible, but currently not implemented. In general host swap without balloning is usually a bad idea anyways because it often just swaps a lot of cache data that could easily be thrown away instead. -andi
Re: [PATCH] perf report: Fix wrong LBR block sorting
> Now the hottest block is reported at the top of output. > > Fixes: b65a7d372b1a ("perf hist: Support block formats with > compare/sort/display") > Signed-off-by: Jin Yao Reviewed-by: Andi Kleen -Andi
Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers
> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake > doesn't. > But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose > counter in Skylake. > Is there anything else that only Icelake supports in this patches set? Only Icelake server has the support for recovering from a EPT violation on the PEBS data structures. To use it on Skylake server you would need to pin the whole guest, but that is currently not done. > Besides, we have tried this patches set in Icelake. We can use pebs(eg: > "perf record -e cycles:pp") > when guest is kernel-5.11, but can't when kernel-4.18. Is there a minimum > guest kernel version requirement? You would need a guest kernel that supports Icelake server PEBS. 4.18 would need backports for tht. -Andi
Re: [RFC v1 00/26] Add TDX Guest Support
On Sat, Apr 03, 2021 at 09:26:24AM -0700, Dave Hansen wrote: > On 4/2/21 2:32 PM, Andi Kleen wrote: > >> If we go this route, what are the rules and restrictions? Do we have to > >> say "no MMIO in #VE"? > > > > All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments" > > After that it can nest without problems. > > Well, not exactly. You still can't do things that will could cause a n > unbounded recusive #VE. > It doesn't seem *that* far fetched to think that someone might try to > defer some work or dump data to the console. I believe the main console code has reentry protection. I'm not sure about early_printk (with keep), buf it that's the case it probably should be fixed anyways. I can take a look at that. Not sure why deferring something would cause another #VE? > > If you nest before that the TDX will cause a triple fault. > > > > The code that cannot do it is a few lines in the early handler which > > runs with interrupts off. > > >> Which brings up another related point: How do you debug TD guests? Does > >> earlyprintk work? > > > > Today it works actually because serial ports are allowed. But I expect it to > > be closed eventually because serial code is a lot of code to audit. > > But you can always disable the filtering with a command line option and > > then it will always work for debugging. > > Do we need a TDX-specific earlyprintk? I would imagine it's pretty easy > to implement. Don't see a need at this point, the existing mechanisms work. Maybe if we ever have a problem that only happen in lockdown *and* happens early, but that's not very likely since lock down primarily changes code behavior later. There are also other debug mechanisms for such cases: in TDX if you configure the TD for debug mode supports using the gdb stub on the hypervisor. -Andi
Re: [RFC v1 00/26] Add TDX Guest Support
> If we go this route, what are the rules and restrictions? Do we have to > say "no MMIO in #VE"? All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments" After that it can nest without problems. If you nest before that the TDX will cause a triple fault. The code that cannot do it is a few lines in the early handler which runs with interrupts off. The TDX module also makes sure to not inject NMIs while we're in that region, so NMIs are of no concern. That was the whole point of avoiding the system call gap problem. We don't need to make it IST, so it can nest. I'm not aware of any other special rules. > Which brings up another related point: How do you debug TD guests? Does > earlyprintk work? Today it works actually because serial ports are allowed. But I expect it to be closed eventually because serial code is a lot of code to audit. But you can always disable the filtering with a command line option and then it will always work for debugging. -Andi
Re: [PATCH v2] kconfig: config script: add a little user help
On Sat, Dec 19, 2020 at 09:08:05AM -0800, Randy Dunlap wrote: > Give the user a clue about the problem along with the 35 lines of > usage/help text. Reviewed-by: Andi Kleen -Andi
Re: [RFC v1 00/26] Add TDX Guest Support
> I've heard things like "we need to harden the drivers" or "we need to do > audits" and that drivers might be "whitelisted". The basic driver allow listing patches are already in the repository, but not currently posted or complete: https://github.com/intel/tdx/commits/guest > > What are we talking about specifically? Which drivers? How many > approximately? Just virtio? Right now just virtio, later other drivers that hypervisors need. > Are there any "real" hardware drivers > involved like how QEMU emulates an e1000 or rtl8139 device? Not currently (but some later hypervisor might rely on one of those) > What about > the APIC or HPET? No IO-APIC, but the local APIC. No HPET. > > How broadly across the kernel is this going to go? Not very broadly for drivers. > > Without something concrete, it's really hard to figure out if we should > go full-blown paravirtualized MMIO, or do something like the #VE > trapping that's in this series currently. As Sean says the concern about MMIO is less drivers (which should be generally ok if they work on other architectures which require MMIO magic), but other odd code that only ran on x86 before. I really don't understand your crusade against #VE. It really isn't that bad if we can avoid the few corner cases. For me it would seem wrong to force all MMIO for all drivers to some complicated paravirt construct, blowing up code side everywhere and adding complicated self modifying code, when it's only needed for very few drivers. But we also don't want to patch every MMIO to be special cased even those few drivers. #VE based MMIO avoids all that cleanly while being nicely non intrusive. -Andi
Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On Wed, Mar 31, 2021 at 08:46:18PM -0700, Dave Hansen wrote: > On 3/31/21 8:28 PM, Andi Kleen wrote: > >> The hardware (and VMMs and SEAM) have ways of telling the guest kernel > >> what is supported: CPUID. If it screws up, and the guest gets an > >> unexpected #VE, so be it. > > The main reason for disabling stuff is actually that we don't need > > to harden it. All these things are potential attack paths. > > Wait, MWAIT is an attack path? If it were an attack path, wouldn't it No MWAIT is not, but lots of other things that can be controlled by the host are. And that will be a motivation to disable things. > >> We don't have all kinds of crazy handling in the kernel's #UD handler > >> just in case a CPU mis-enumerates a feature and we get a #UD. We have > >> to trust the underlying hardware to be sane. If it isn't, we die a > >> horrible death as fast as possible. Why should TDX be any different? > > That's what the original patch did -- no unnecessary checks -- but reviewers > > keep asking for the extra checks, so Sathya added more. We have the not > > unusual problem here that reviewers don't agree among themselves. > > Getting consensus is a pain in the neck, eh? Tt seems more like a circular argument currently. > > It's too bad all the reviewers in the community aren't like all of the > engineers at big companies where everyone always agrees. :) I would propose to go back to the original patch without all the extra checks. I think that's what you're arguing too. IIRC the person who originally requested extra checks was Andy, if he's ok with that too we can do it, so that you guys can finally move on to the other patches that actually do more than just trivial things. -Andi
Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
> The hardware (and VMMs and SEAM) have ways of telling the guest kernel > what is supported: CPUID. If it screws up, and the guest gets an > unexpected #VE, so be it. The main reason for disabling stuff is actually that we don't need to harden it. All these things are potential attack paths. > > We don't have all kinds of crazy handling in the kernel's #UD handler > just in case a CPU mis-enumerates a feature and we get a #UD. We have > to trust the underlying hardware to be sane. If it isn't, we die a > horrible death as fast as possible. Why should TDX be any different? That's what the original patch did -- no unnecessary checks -- but reviewers keep asking for the extra checks, so Sathya added more. We have the not unusual problem here that reviewers don't agree among themselves. -Andi
Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
On Tue, Mar 30, 2021 at 12:56:41PM +0800, Xiaoyao Li wrote: > On 3/27/2021 8:18 AM, Kuppuswamy Sathyanarayanan wrote: > > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions > > are not supported. So handle #VE due to these instructions as no ops. > > > > Signed-off-by: Kuppuswamy Sathyanarayanan > > > > Reviewed-by: Andi Kleen > > --- > > > > Changes since previous series: > > * Suppressed MWAIT feature as per Andi's comment. > > * Added warning debug log for MWAIT #VE exception. > > > > arch/x86/kernel/tdx.c | 23 +++ > > 1 file changed, 23 insertions(+) > > > > diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c > > index e936b2f88bf6..fb7d22b846fc 100644 > > --- a/arch/x86/kernel/tdx.c > > +++ b/arch/x86/kernel/tdx.c > > @@ -308,6 +308,9 @@ void __init tdx_early_init(void) > > setup_force_cpu_cap(X86_FEATURE_TDX_GUEST); > > + /* MWAIT is not supported in TDX platform, so suppress it */ > > + setup_clear_cpu_cap(X86_FEATURE_MWAIT); > > In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This > is enforced by SEAM module. Good point. > > Do we still need to safeguard it by setup_clear_cpu_cap() here? I guess it doesn't hurt to do it explicitly. -Andi
Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD
> > No, if these instructions take a #VE then they were executed at CPL=0. > > MONITOR > > and MWAIT will #UD without VM-Exit->#VE. Same for WBINVD, s/#UD/#GP. > > Dare I ask about XSETBV? XGETBV does not cause a #VE, it just works normally. The guest has full AVX capabilities. -Andi
Re: [RESEND PATCH v2] perf stat: improve readability of shadow stats
> > But is this a real problem? > > perhaps not, Andi, any idea about this? It's not a problem for my tools which don't use the unit, but I could imagine one for other parsers. I would recommend to not change it for CSV, which is expected to be parsed by tools. -Andi
Re: [RFC PATCH v3 00/18] dynamic debug diet plan
Jim Cromie writes: > CONFIG_DYNAMIC_DEBUG creates a struct _ddebug (56 bytes) for each > callsite, which includes 3 pointers to: module, filename, function. > These are repetetive, and compressible, this patch series goes about > doing that, it: So how much memory does it actually save? -Andi
Re: [PATCH v2 11/27] perf parse-events: Support hardware events inside PMU
> While we're discussing, do we really want to use the "core" and "atom" > terms here? I thought cpu/cycles/ would be ok for the main (Big) CPU and Yes absolutely. > that we should come up with some short name for the "litle" CPUs. There actually isn't a main CPU. There's nothing "better" about the big cores vs the Atoms anyways. They're all important CPUs. And the system might have no "big" CPUs, but we won't know until we finished onlining all CPUs. Or on Lakefield there are four Atoms and only a single big core. So with a non hybrid aware profiler tool you would miss most of the system if we used cpu// for the big core. Also I think renaming is a good idea because it forces the software or configuration to handle hybrid. Otherwise you just get subtle breakage all the time with some CPUs not getting profiled. It's a similar strategy as we do in the source code when semantics change. ARM did this right. -Andi
Re: [PATCH v2] perf stat: Align CSV output for summary mode
> If you care about not breaking existing scripts, then the output they > get with what they use as command line options must continue to produce > the same output. It's not clear there are any useful ones (except for tools that handle both). It's really hard to parse the previous mess. It's simply not valid CSV. That's why I'm arguing that keeping compatibility is not useful here. We would be stuck with the broken mess as default forever. -Andi
Re: [PATCH] perf stat: Align CSV output for summary mode
> Is it serious or just a joke? :) I would prefer to not be compatible (at least not until someone complains), but if compatibility is required then yes opting in to the broken format would be better. Perhaps not with that name. And the option could be hidden in the perf config file instead of being on the command line. -Andi
Re: [PATCH] perf stat: Align CSV output for summary mode
On Tue, Mar 16, 2021 at 04:05:13PM -0300, Arnaldo Carvalho de Melo wrote: > Em Tue, Mar 16, 2021 at 09:34:21AM -0700, Andi Kleen escreveu: > > > looks ok, but maybe make the option more related to CVS, like: > > > > > > --x-summary, --cvs-summary ...? > > > > Actually I don't think it should be a new option. I doubt > > anyone could parse the previous mess. So just make it default > > with -x > > In these cases I always fear that people are already parsing that mess > by considering the summary lines to be the ones not starting with > spaces, and now we go on and change it to be "better" by prefixing it > with "summary" and... break existing scripts. I think it was just one version or so? FWIW perf has broken CSV output several times, I added workarounds to toplev every time. Having a broken version for a short time shouldn't be too bad. I actually had a workaround for this one, but it can parse either way. > > Can we do this with a new option? > > I.e. like --cvs-summary? If you do it I would add an option for the old broken format --i-want-broken-csv. But not require the option forever just to get sane output. Or maybe only a perf config option. -Andi
Re: [PATCH] perf stat: Align CSV output for summary mode
> looks ok, but maybe make the option more related to CVS, like: > > --x-summary, --cvs-summary ...? Actually I don't think it should be a new option. I doubt anyone could parse the previous mess. So just make it default with -x -Andi
Re: [PATCH v4 00/25] Page folios
Michal Hocko writes: > bikeshedding) because it hasn't really resonated with the udnerlying > concept. Maybe just me as a non native speaker... page_head would have > been so much more straightforward but not something I really care > about. Yes. page_head explains exactly what it is. But Folio? Huh? -Andi
Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF
> I still think that there is value in taking those measurements after we > enable the counters, as, for instance, for interval mode we want > measurements with the counters enabled, whatever happens before the > counters are enabled is just startup time, etc. Jiri, Andi? Yes I agree. Only the time with counters enabled matters. -Andi
Re: [PATCH v1 10/14] mm: multigenerational lru: core
Yu Zhao writes: > + > +#ifdef CONFIG_MEMCG > + if (memcg && atomic_read(>moving_account)) > + goto contended; > +#endif > + if (!mmap_read_trylock(mm)) > + goto contended; These are essentially spinloops. Surely you need a cpu_relax() somewhere? In general for all of spinloop like constructs it would be useful to consider how to teach lockdep about them. > + do { > + old_flags = READ_ONCE(page->flags); > + new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1; > + VM_BUG_ON_PAGE(new_gen < 0, page); > + if (new_gen >= 0 && new_gen != old_gen) > + goto sort; > + > + new_gen = (old_gen + 1) % MAX_NR_GENS; > + new_flags = (old_flags & ~LRU_GEN_MASK) | ((new_gen + 1UL) << > LRU_GEN_PGOFF); > + /* mark page for reclaim if pending writeback */ > + if (front) > + new_flags |= BIT(PG_reclaim); > + } while (cmpxchg(>flags, old_flags, new_flags) != > old_flags); I see this cmpxchg flags pattern a lot. Could there be some common code factoring? -Andi
Re: [PATCH] perf stat: Create '--add-default' option to append default list
> A more concise syntax would be -e +event > > The + would apply to the whole event list if there are multiple. > > and maybe -event too to remove something from the default list. Sorry that was an old email. Please ignore. -Andi
Re: [PATCH] perf stat: Create '--add-default' option to append default list
On Tue, Dec 22, 2020 at 09:11:31AM +0800, Jin Yao wrote: > The event default list includes the most common events which are widely > used by users. But with -e option, the current perf only counts the events > assigned by -e option. Users may want to collect some extra events with > the default list. For this case, users have to manually add all the events > from the default list. It's inconvenient. Also, users may don't know how to > get the default list. > > It's better to add a new option to append default list to the -e events. > The new option is '--add-default'. A more concise syntax would be -e +event The + would apply to the whole event list if there are multiple. and maybe -event too to remove something from the default list. -Andi
Re: [PATCH 0/3] perf tools: Minor improvements in event synthesis
On Mon, Dec 21, 2020 at 04:00:26PM +0900, Namhyung Kim wrote: > Hello, > > This is to optimize the event synthesis during perf record. > > The first patch is to reduce memory usage when many threads are used. > The second is to avoid unncessary syscalls for kernel threads. And > the last one is to reduce the number of threads to iterate when new > threads are being created at the same time. > > Unfortunately there's no dramatic improvement here but I can see ~5% > gain in the 'perf bench internals synthesize' on a big machine. > (The numbers are not stable though) Looks all good to me. The VmPeak assumption might be slightly fragile, but I guess there's nothing better currently. Reviewed-by: Andi Kleen -Andi
Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
> If you want a uarch name, that might be the name of the core > ( something-cove ... I can't keep track of the names of all the GoldenCove for Sapphire Rapids/AlderLake. But keep in mind that they're still different. Kind of like a different distro with different patches and configuration from the same base kernel release. -Andi
Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support
On Thu, Mar 11, 2021 at 08:58:32PM +0100, Peter Zijlstra wrote: > On Thu, Mar 11, 2021 at 11:53:35AM -0500, Liang, Kan wrote: > > > > > The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without > > > > PMEM. > > > > The "cpu_atom" PMU is similar to Tremont, but with different > > > > event_constraints, extra_regs and number of counters. > > > > So do these things use the same event lists as SPR and TNT? > > > > No, there will be two new event lists on ADL. One is for Atom core, and the > > other is for big core. They are different to SPR and TNT. > > *sigh* how different? Atom and Big core event list have significant differences. Often event lists of the same core in different SOCs are different too because some events indicate stuff outside the core (e.g. Offcore Response and others) > > > > Is there any > > > way to discover that, because AFAICT /proc/cpuinfo will say every CPU > > > is 'Alderlake', and the above also doesn't give any clue. > > > > > > > Ricardo once submitted a patch to expose the CPU type under > > /sys/devices/system/cpu, but I don't know the latest status. > > https://lore.kernel.org/lkml/20201003011745.7768-5-ricardo.neri-calde...@linux.intel.com/ > > Yeah, but that was useless, it doesn't list the Cores as > FAM6_SAPPHIRERAPIDS nor the Atom as FAM6_ATOM_TREMONT. It's Gracemont, not Tremont. But what would you do with the information that the core is related to some other core. The event lists are tied to the Alderlake model number. You cannot just use event lists from some other part because there are differences to other Golden Cove or Gracemont implementations. For non event list usages, the model numbers also indicate a lot of things in the SOC, so even you knew it was a somewhat similar core as Sapphire Rapids, it wouldn't tell the complete story. Even on the cores there are differences. In the end you need to know that Alderlake is Alderlake. -Andi
Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM
> When a brute force attack is detected through the fork or execve system call, > all the tasks involved in the attack will be killed with the exception of the > init task (task with pid equal to zero). Now, and only if the init task is > involved in the attack, block the fork system call from the init process > during > a user defined time (using a sysctl attribute). This way the brute force > attack > is mitigated and the system does not panic. That means nobody can log in and fix the system during that time. Would be better to have that policy in init. Perhaps add some way that someone doing wait*() can know the exit was due this mitigation (and not something way) Then they could disable respawning of that daemon. -Andi
Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM
Thanks. Okay but that means that the brute force attack can just continue because the attacked daemon will be respawned? You need some way to stop the respawning, otherwise the mitigation doesn't work for daemons. -Andi
Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs
> AFAICT we could register them all here. That instantly fixes that > CPU_STARTING / CPU_DEAD fail elsewhere in this patch. This would mean a system that only has Atoms or only has big cores would still show the other CPU's PMU. We expect those to exist. -Andi
[PATCH] perf/x86/kvm: Fix inverted pebs_no_isolation check
The pebs_no_isolation optimization check is inverted. We want to disable PEBS isolation when the microcode is at least the revision in the table, not for older microcode. So remove the extra !. It meant that the new microcodes did unnecessary work, and the old microcodes could crash the PEBS. Presumably most people already running with newer microcodes, so that functional impact is small. But it should speed up the newer systems by the 2-4% claimed in the original patch. Cc: jmatt...@google.com Fixes: 9b545c04abd4 ("perf/x86/kvm: Avoid unnecessary work ...") Signed-off-by: Andi Kleen --- arch/x86/events/intel/core.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c index 5bac48d5c18e..d74deadd3a6d 100644 --- a/arch/x86/events/intel/core.c +++ b/arch/x86/events/intel/core.c @@ -4532,7 +4532,7 @@ static const struct x86_cpu_desc isolation_ucodes[] = { static void intel_check_pebs_isolation(void) { - x86_pmu.pebs_no_isolation = !x86_cpu_has_min_microcode_rev(isolation_ucodes); + x86_pmu.pebs_no_isolation = x86_cpu_has_min_microcode_rev(isolation_ucodes); } static __init void intel_pebs_isolation_quirk(void) -- 2.25.4
Re: [PATCHv2 0/4] perf/core: Add support to exclude kernel mode PMU tracing
> The disk encryption is just one example and there might be others which > we might not be aware of yet and we are not suspecting there is something > wrong with the crypto code that needs to be fixed. Then you don't have any leaks relating to branch tracing. > restrict an external(in the sense that its not related to crypto or any > other security related component) entity such as hardware assisted tracing > like ARM coresight and so on. I don't see why or how the crypto code needs > to be fixed for something that is not related to it although it is affected. It's just a general property that if some code that is handling secrets is data dependent it already leaks. > The analogy would be like of the victims and a perpetrator. Lets take > coresight > as an example for perpetrator and crypto as the victim here. Now we can try There's no victim with branch tracing, unless it is already leaky. > If we just know one victim (lets say crypto code here), what happens to the > others which we haven't identified yet? Do we just wait for someone to write > an exploit based on this and then scramble to fix it? For a useful security mitigation you need a threat model first I would say. So you need to have at least some idea how an attack with branch tracing would work. > Initial change was to restrict this only to HW assisted instruction tracing > [1] I don't think it's needed for instruction tracing. -Andi
Re: [PATCH] perf auxtrace: Fix auxtrace queue conflict
On Mon, Mar 08, 2021 at 05:11:43PM +0200, Adrian Hunter wrote: > The only requirement of an auxtrace queue is that the buffers are in > time order. That is achieved by making separate queues for separate > perf buffer or AUX area buffer mmaps. > > That generally means a separate queue per cpu for per-cpu contexts, > and a separate queue per thread for per-task contexts. > > When buffers are added to a queue, perf checks that the buffer cpu > and thread id (tid) match the queue cpu and thread id. > > However, generally, that need not be true, and perf will queue > buffers correctly anyway, so the check is not needed. > > In addition, the check gets erroneously hit when using sample mode > to trace multiple threads. > > Consequently, fix that case by removing the check. Thanks! Reviewed-by: Andi Kleen -Andi
Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM
On Sun, Mar 07, 2021 at 07:05:41PM +0100, John Wood wrote: > On Sun, Mar 07, 2021 at 09:25:40AM -0800, Andi Kleen wrote: > > > processes created from it will be killed. If the systemd restart the > > > network > > > daemon and it will crash again, then the systemd will be killed. I think > > > this > > > way the attack is fully mitigated. > > > > Wouldn't that panic the system? Killing init is usually a panic. > > The mitigation acts only over the process that crashes (network daemon) and > the > process that exec() it (systemd). This mitigation don't go up in the processes > tree until reach the init process. Most daemons have some supervisor that respawns them when they crash. (maybe read up on "supervisor trees" if you haven't, it's a standard concept) That's usually (but not) always init, as in systemd. There might be something inbetween it and init, but likely init would respawn the something in between it it. One of the main tasks of init is to respawn things under it. If you have a supervisor tree starting from init the kill should eventually travel up to init. At least that's the theory. Do you have some experiments that show this doesn't happen? > > Note: I am a kernel newbie and I don't know if the systemd is init. Sorry if > it > is a stupid question. AFAIK systemd is not the init process (the first process > that is executed) but I am not sure. At least the part of systemd that respawns is often (but not always) init. > > > > > > > Or if it's a interactive login you log in again. > > > > > > First the login will be killed (if it fails with a fatal signal) and if > > > it is > > > restarted, the process that exec() it again will be killed. In this case > > > I think > > > that the threat is also completely mitigated. > > > > Okay so sshd will be killed. And if it gets restarted eventually init, > > so panic again. > > In this scenario the process that exec() the login will be killed (sshd > process). But I think that sshd is not the init process. So no panic. sshd would be respawned by the supervisor, which is likely init. > > That's a fairly drastic consequence because even without panic > > it means nobody can fix the system anymore without a console. > > So, you suggest that the mitigation method for the brute force attack through > the execve system call should be different (not kill the process that exec). > Any suggestions would be welcome to improve this feature. If the system is part of some cluster, then panicing on attack or failure could be a reasonable reaction. Some other system in the cluster should take over. There's also a risk that all the systems get taken out quickly one by one, in this case you might still need something like the below. But it's something that would need to be very carefully considered for the environment. The other case is when there isn't some fallback, as in a standalone machine. It could be only used when the supervisor daemons are aware of it. Often they already have respawn limits, but would need to make sure they trigger before your algorithm trigger. Or maybe some way to opt-out per process. Then the DoS would be only against that process, but not everything on the machine. So I think it needs more work on the user space side for most usages. -Andi
Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM
> processes created from it will be killed. If the systemd restart the network > daemon and it will crash again, then the systemd will be killed. I think this > way the attack is fully mitigated. Wouldn't that panic the system? Killing init is usually a panic. > > Or if it's a interactive login you log in again. > > First the login will be killed (if it fails with a fatal signal) and if it is > restarted, the process that exec() it again will be killed. In this case I > think > that the threat is also completely mitigated. Okay so sshd will be killed. And if it gets restarted eventually init, so panic again. That's a fairly drastic consequence because even without panic it means nobody can fix the system anymore without a console. So probably the mitigation means that most such attacks eventually lead to a panic because they will reach init sooner or later. Another somewhat worrying case is some bug that kills KVM guests. So if the bug can be triggered frequently you can kill all the virtualization management infrastructure. I don't remember seeing a discussion of such drastic consequences in your description. It might be ok depending on the use case, but people certainly need to be aware of it. It's probably not something you want to have enabled by default ever. -Andi
Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM
Sorry for the late answer. I somehow missed your email earlier. > As a mitigation method, all the offending tasks involved in the attack are > killed. Or in other words, all the tasks that share the same statistics > (statistics showing a fast crash rate) are killed. So systemd will just restart the network daemon and then the attack works again? Or if it's a interactive login you log in again. I think it might be useful even with these limitations, but it would be good to spell out the limitations of the method more clearly. I suspect to be useful it'll likely need some user space configuration changes too. -Andi
Re: [PATCH] perf pmu: Validate raw event with sysfs exported format bits
> Single event: > > # ./perf stat -e cpu/r031234/ -a -- sleep 1 > WARNING: event config '31234' not valid (bits 16 17 not supported by > kernel)! Just noticed that again. Can you please print the original event as string in the message? While it's obvious with rXXX which one it is, it may not be obvious with offsetted fields (like umask=0x121212), and hard to find in a long command line. -Andi
Re: [PATCHv2 0/4] perf/core: Add support to exclude kernel mode PMU tracing
Andi Kleen writes: > > Normally disk encryption is in specialized work queues. It's total > overkill to restrict all of the kernel if you just want to restrict > those work queues. > > I would suggest some more analysis where secrets are actually stored > and handled first. Also thinking about this more: You really only want to limit data tracing here. If tracing branches could reveal secrets the crypto code would be already insecure due to timing side channels. If that's the case it would already require fixing the crypto code. -Andi
Re: [PATCHv2 0/4] perf/core: Add support to exclude kernel mode PMU tracing
Sai Prakash Ranjan writes: > > "Consider a system where disk contents are encrypted and the encryption > key is set up by the user when mounting the file system. From that point > on the encryption key resides in the kernel. It seems reasonable to > expect that the disk encryption key be protected from exfiltration even > if the system later suffers a root compromise (or even against insiders > that have root access), at least as long as the attacker doesn't > manage to compromise the kernel." Normally disk encryption is in specialized work queues. It's total overkill to restrict all of the kernel if you just want to restrict those work queues. I would suggest some more analysis where secrets are actually stored and handled first. -Andi
Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM
John Wood writes: > + > +To detect a brute force attack it is necessary that the statistics shared by > all > +the fork hierarchy processes be updated in every fatal crash and the most > +important data to update is the application crash period. So I haven't really followed the discussion and also not completely read the patches (so apologies if that was already explained or is documented somewhere else). But what I'm missing here is some indication how much memory these statistics can use up and how are they limited. How much is the worst case extra memory consumption? If there is no limit how is DoS prevented? If there is a limit, there likely needs to be a way to throw out information, and so the attack would just shift to forcing the kernel to throw out this information before retrying. e.g. if the data is hold for the parent shell: restart the parent shell all the time. e.g. if the data is hold for the sshd daemon used to log in: Somehow cause sshd to respawn to discard the statistics. Do I miss something here? How is that mitigated? Instead of discussing all the low level tedious details of the statistics it would be better to focus on these "high level" problems here. -Andi
Re: [PATCH] perf vendor events: Remove 'type' field from mapfiles
> I assumed that it was this file: > https://download.01.org/perfmon/mapfile.csv Yep. > > which already has a different format - extra columns - so figured that some > handcrafting was already going on. Indeed 'type' is not always 'core' there. Yes. Could possibly add the extra columns from there just to stay compatible, but being a subset is ok too. -Andi
Re: [PATCH] perf vendor events: Remove 'type' field from mapfiles
On Fri, Feb 26, 2021 at 09:08:17PM +0800, John Garry wrote: > Included for each CPU entry in the per-arch mapfile.csv file is a 'type' > field. I don't like it because this will make the mapfile incompatible with what download.01.org uses. We originally derived it from that. It would be ok if it fixed something serious, but as it seems to be more cosmetic I don't think there's a good reason to devivate. -Andi
Re: [PATCH 00/11] perf intel-pt: Add limited support for tracing guest kernels
On Thu, Feb 18, 2021 at 11:57:50AM +0200, Adrian Hunter wrote: > Hi > > Currently, only kernel tracing is supported and only with "timeless" decoding > i.e. no TSC timestamps Patches look good to me. That will be quite useful. Acked-by: Andi Kleen Thanks, -Andi
Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
On Fri, Feb 12, 2021 at 01:48:36PM -0800, Dave Hansen wrote: > On 2/12/21 1:47 PM, Andy Lutomirski wrote: > >> What about adding a property to the TD, e.g. via a flag set during TD > >> creation, > >> that controls whether unaccepted accesses cause #VE or are, for all > >> intents and > >> purposes, fatal? That would allow Linux to pursue treating EPT #VEs for > >> private > >> GPAs as fatal, but would give us a safety and not prevent others from > >> utilizing > >> #VEs. > > That seems reasonable. > > Ditto. > > We first need to double check to see if the docs are right, though. I confirmed with the TDX module owners that #VE can only happen for: - unaccepted pages - instructions like MSR access or CPUID - specific instructions that are no in the syscall gap Also if there are future asynchronous #VEs they would only happen with IF=1, which would also protect the gap. So no need to make #VE an IST. -Andi
Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
> > > So what happens if NMI happens here, and triggers a nested #VE ? > > > > Yes that's a gap. We should probably bail out and reexecute the original > > instruction. The VE handler would need to set a flag for that. > > > > Or alternatively the NMI always gets the VE information and puts > > it on some internal stack, but that would seem clunkier. > > The same is possible with MCE and #DB I imagine. I don't think there are currently any plans to inject #MC into TDX guests. It's doubtful this could be done securely. #DB is trickier because it will happen every time, so simply reexecuting won't work. I guess it would need the ve info stack, or some care in kprobes/kernel debugger that it cannot happen. I think I would prefer the later. -Andi
Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest
> Which is supposedly then set up to avoid #VE during the syscall gap, > yes? Which then results in #VE not having to be IST. Yes that is currently true because all memory is pre-accepted. If we ever do lazy accept we would need to make sure the memory accessed in the syscall gap is already accepted, or move over to an IST. > > +#ifdef CONFIG_INTEL_TDX_GUEST > > +DEFINE_IDTENTRY(exc_virtualization_exception) > > +{ > > + struct ve_info ve; > > + int ret; > > + > > + RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU"); > > + > > + /* Consume #VE info before re-enabling interrupts */ > > So what happens if NMI happens here, and triggers a nested #VE ? Yes that's a gap. We should probably bail out and reexecute the original instruction. The VE handler would need to set a flag for that. Or alternatively the NMI always gets the VE information and puts it on some internal stack, but that would seem clunkier. -Andi
Re: [PATCH RESEND] perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]
On Fri, Feb 05, 2021 at 11:13:24AM -0800, Jim Mattson wrote: > Cascade Lake Xeon parts have the same model number as Skylake Xeon > parts, so they are tagged with the intel_pebs_isolation > quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS > isolation issue is fixed in all microcode versions. > > Add the Cascade Lake Xeon steppings (5, 6, and 7) to the > isolation_ucodes[] table so that these parts benefit from Andi's > optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary > work in guest filtering"). Reviewed-by: Andi Kleen -Andi
Re: [PATCH 4/0] perf intel-pt: Add PSB events
On Fri, Feb 05, 2021 at 07:53:46PM +0200, Adrian Hunter wrote: > Hi > > Here are 3 fixes and 1 minor new feature, for Intel PT. For the series: Reviewed-by: Andi Kleen -Andi
Re: [PATCH] perf/core: Emit PERF_RECORD_LOST for pinned events
> > I don't think I object to having an even in the stream, but your LOST > > event is unfortunate in that it itself can get lost when there's no > > space in the buffer (which arguably is unlikely, but still). > > > > So from that point of view, I think overloading LOST is not so very nice > > for this. > > But anything can get lost in case of no space. > Do you want to use something other than the LOST event? Could always reserve the last entry in the ring buffer for a LOST event, that would guarantee you can always get one out. -Andi
Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS
On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote: > On Fri, Jan 15, 2021, Andi Kleen wrote: > > > I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS > > > PMI" > > > guaranteed to be atomic? > > > > Of course not. > > So there's still a window where the guest could observe the bad counter index, > correct? Yes. But with single record PEBS it doesn't really matter with normal perfmon drivers. -Andi
Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS
> I'm asking about ucode/hardare. Is the "guest pebs buffer write -> PEBS PMI" > guaranteed to be atomic? Of course not. > > In practice, under what scenarios will guest counters get cross-mapped? And, > how does this support affect guest accuracy? I.e. how bad do things get for > the > guest if we simply disable guest counters if they can't have a 1:1 association > with their physical counter? This would completely break perfmon for the guest, likely with no way to recover. -Andi
Re: [PATCH] mm: Fix potential pte_unmap_unlock pte error
On Sat, Jan 09, 2021 at 03:01:18AM -0500, Miaohe Lin wrote: > Since commit 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged > high MMIO PROT_NONE mappings"), when the first pfn modify is not allowed, > we would break the loop with pte unchanged. Then the wrong pte - 1 would > be passed to pte_unmap_unlock. Thanks. While the fix is correct, I'm not sure if it actually is a real bug. Is there any architecture that would do something else than unlocking the underlying page? If it's just the underlying page then it should be always the same page, so no bug. That said of course the change is the right thing for main line, but probably doesn't need to be backported. -Andi
Re: [RFC PATCH 0/7] Share events between metrics
> Or, is the concern more about trying to time-slice the results in a > fairly granular way and expecting accurate results then? Usually the later. It's especially important for divisions. You want both divisor and dividend to be in the same time slice, otherwise the result usually doesn't make a lot of sense. -Andi
Re: [PATCH 0/5] perf stat: Introduce --iiostat mode to provide I/O performance metrics
> My first thought was: Why not have a 'perf iiostat' subcommand? Same would apply to a lot of options in perf stat. I guess you could add some aliases to "perf" that give shortcuts for common perf stat command lines. -Andi
Re: [RFC 1/2] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event
On Wed, Dec 02, 2020 at 11:47:25AM -0800, Stephane Eranian wrote: > On Wed, Dec 2, 2020 at 11:28 AM Andi Kleen wrote: > > > > > + prev_cgrp = task_css_check(prev, perf_event_cgrp_id, 1)->cgroup; > > > + next_cgrp = task_css_check(next, perf_event_cgrp_id, 1)->cgroup; > > > + > > > + if (prev_cgrp != next_cgrp) > > > + perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0); > > > > Seems to be the perf cgroup only, not all cgroups. > > That's a big difference and needs to be documented properly. > > > We care about the all-cgroup case. Then it's not correct I think. You need a different hook point. -Andi
Re: [RFC 1/2] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event
> + prev_cgrp = task_css_check(prev, perf_event_cgrp_id, 1)->cgroup; > + next_cgrp = task_css_check(next, perf_event_cgrp_id, 1)->cgroup; > + > + if (prev_cgrp != next_cgrp) > + perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0); Seems to be the perf cgroup only, not all cgroups. That's a big difference and needs to be documented properly. Probably would make sense to have two events for both, one for all cgroups and one for perf only. -Andi
Re: [PATCH v4 5/5] perf metric: Don't compute unused events.
> Can't this be all in a macro? It's still a lot of duplication. > > I'm still not a fan, but I think with a macro it wouldn't be too bad. > >Ugh, not a fan of macros. They expand to a single line of code and make >debugging hard. I'll do a v5 with a macro :-/ I suppose you could inlines with callbacks too, with one liner functions for the operands. -Andi
Re: [PATCH v4 5/5] perf metric: Don't compute unused events.
> +| expr '-' expr > +{ > + if (!compute_ids || (is_const($1.val) && is_const($3.val))) { > + assert($1.ids == NULL); > + assert($3.ids == NULL); > + $$.val = $1.val - $3.val; > + $$.ids = NULL; > + } else { > + /* LHS and/or RHS need computing from event IDs so union. */ > + $$ = union_expr($1, $3); > + } Can't this be all in a macro? It's still a lot of duplication. I'm still not a fan, but I think with a macro it wouldn't be too bad. -Andi
Re: [PATCH 0/3] clear_warn_once: add timed interval resetting
On Thu, Nov 26, 2020 at 01:30:26AM -0500, Paul Gortmaker wrote: > But you currently can't make use of clear_warn_once unless you've got > debugfs enabled and mounted - which may not be desired by some people > in some deployment situations. Seems awfully special purpose. The problem with debugfs is security, or is it no convenient process that could do cron like functionality? If it's the first, perhaps what they really need is a way to get a partial debugfs? -Andi
Re: [PATCH v3 5/5] perf metric: Don't compute unused events.
> +| expr '|' expr > +{ > + if (!compute_ids || (isfinite($1.val) && isfinite($3.val))) { > + assert($1.ids == NULL); > + assert($3.ids == NULL); > + $$.val = (long)$1.val | (long)$3.val; > + $$.ids = NULL; > + } else { > + /* > + * LHS or RHS needs to be computed from event IDs, consequently > + * so does this expression. Set val to NAN to show that the set > + * of all values is possible, the events are the union of those > + * on the LHS and RHS. > + */ > + $$.val = NAN; > + $$.ids = ids__union($1.ids, $3.ids); > + } Sorry, still not a fan of the having this nan code all over. It's just ugly. If you don't want to do the syntax change to still do it in one pass, and given building an AST would be a little complicated. The original parser I based this code on actually had a byte code version too (see attachment). With that one the lazy evaluation could be done on the byte code level. Originally I didn't include it because it wasn't really needed for perf, but presumably if we want to do more complicated things it might be useful. In theory it could speed up performance slightly when an expression needs to be computed multiple times in interval mode. -Andi #include #include #include #include #include "code.h" #include "sym.h" #include "error.h" static unsigned char *bufs; static unsigned bufl; static unsigned char *bufp; static void overflow(void) { yyerror("expression exceeds execution buffer"); } void put_op(enum ops op) { if (bufp >= bufs + bufl) overflow(); *bufp++ = op; } void put_long(enum ops op, long arg) { if (bufp + sizeof(long) + 1 >= bufs + bufl) overflow(); *bufp++ = op; memcpy(bufp, , sizeof(long)); bufp += sizeof(long); } void put_ptr(enum ops op, void * arg) { if (bufp + sizeof(void *) + 1 >= bufs + bufl) overflow(); *bufp++ = op; memcpy(bufp, , sizeof(void *)); bufp += sizeof(void *); } void start_def(struct symbol *s) { if (!s->undef) yyerror("symbol %s redefined", s->name); } void end_def(struct symbol *s) { int len; put_op(OP_RET); len = bufp - bufs; s->code = malloc(len); memcpy(s->code, bufs, len); bufp = bufs; s->undef = 0; } static void execerror(char const *msg) { fprintf(stderr, "expression execution error: %s\n", msg); exit(1); } #define STACKSIZE 16 #define CSTACKSIZE 16 long execute(unsigned char *bufp) { static void *target[] = { [OP_END]= &, [OP_EVENT] = NULL, [OP_NUMBER] = &, [OP_PLUS] = &, [OP_MINUS] = &, [OP_MUL]= &, [OP_DIV]= &, [OP_MOD]= &, [OP_NEGATE] = &, [OP_CALL] = &, [OP_RET]= &, [OP_XOR]= &, [OP_OR] = &, [OP_NOT]= &, [OP_AND]= &, [OP_SHL]= &, [OP_SHR]= &, }; long a, b; long stack[STACKSIZE]; int stackp = 0; unsigned char *callstack[CSTACKSIZE]; int callstackp = 0; #define getop(x) memcpy(&(x), bufp, sizeof(x)); bufp += sizeof(x) #define push(x) stack[stackp++] = (x) #define pop()stack[--stackp] #define next() goto *target[(int)*bufp++] next(); number: if (stackp == STACKSIZE) execerror("expression stack overflow"); getop(a); push(a); next(); #define OP(op) \ b = pop(); \ a = pop(); \ push(a op b); \ next() plus: OP(+); minus: OP(-); mul:OP(*); div: b = pop(); if (b == 0) execerror("division by 0"); a = pop(); push(a / b); next(); mod: b = pop(); if (b == 0) execerror("modulo by 0"); a = pop(); push(a % b); next(); negate: a = pop(); push(-a); next(); and:OP(&); or: OP(|); xor:OP(^); shl:OP(<<); shr:OP(>>); not: a = pop(); push(~a); next(); call: { struct symbol *s; getop(s); if (callstackp == CSTACKSIZE) execerror("call stack overflow"); callstack[callstackp++] = bufp; bufp = s->code; next(); } ret: bufp = callstack[--callstackp]; next(); end: assert(bufp == bufs || stackp == 1); assert(callstackp == 0); return stack[0]; } #undef next #undef push #undef pop #undef getop void code_init(unsigned char *buf, unsigned
Re: [PATCH] perf vendor events: Update Skylake client events to v50
> You mean change event-converter-for-linux-perf to add this as JSON > comments at the start of the generated files? JSON doesn't support comments unfortunately (unless they're somehow part of the data schema) -Andi
Re: [PATCH] perf vendor events: Update Skylake client events to v50
>I'd prefer if we could make a copy of these scripts into the kernel tree >along with the data files from: >https://download.01.org/perfmon/ FWIW I originally tried this to include the raw JSON files, but Ingo objected and wanted the files to be split up[1] and avoiding redundant headers. That's why we ended up with the "compiled" format. [1] https://lkml.org/lkml/2015/5/28/336 >Having a generated file that can't be edited, reviewed, .. I also worry >that if/when we change the json format then event-converter-for-linux-perf >will need to support a multitude of perf versions. You mean for backports? The assumption was that perf backports would backport any perf changes for new formats too. In general I generally discourage any perf tools backport, people should just use a new version of the perf tool on old kernels. In principle the scripts could be included, but without the raw files it would be somewhat pointless. -andi
Re: [PATCH] perf vendor events: Update Skylake client events to v50
> Can I get ACK for this patch? > > It's mainly for fixing the perf-test issue and MEM_Parallel_Reads issue. Acked-by: Andi Kleen The JSON updates should be imho just merged automatically. Not much point in doing code review on them. If there's a problem it has to be fixed JSON upstream anyways. -Andi
Re: [PATCH] perf stat: Take cgroups into account for shadow stats
Actually thinking about it more you should probably pass around ctx/cgroup in a single abstract argument. Otherwise have to change all the metrics functions for the next filter too.
Re: [PATCH] perf stat: Take cgroups into account for shadow stats
> @@ -57,6 +59,9 @@ static int saved_value_cmp(struct rb_node *rb_node, const > void *entry) > if (a->ctx != b->ctx) > return a->ctx - b->ctx; > > + if (a->cgrp != b->cgrp) > + return (char *)a->cgrp < (char *)b->cgrp ? -1 : +1; This means the sort order will depend on heap randomization, which will make it harder to debug. Better use something stable like the inode number of the cgroup. Do we have the same problem with other filters? The rest of the patch looks good to me. -Andi
Re: [PATCH 5/5] perf metric: Don't compute unused events.
The patch does a lot of stuff and is hard to review. The earlier patches all look good to me. > > static int > __expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr, > - int start, int runtime) > + bool compute_ids, int runtime) > { > struct expr_scanner_ctx scanner_ctx = { > - .start_token = start, Okay so why do we not need that anymore? This should probably be a separate change. I'm not sure about the NaN handling. Why can't you use some other value (perhaps 0 with some special case for division) After all the computed values of the dead branch shouldn't be used anyways, so I don't think we need the extra NaN ceremony everywhere. The only thing that really matters is to track that the event is not added and not resolved. I think it would be a lot simpler if we changed the IF syntax. The problem right now is that the condition is after the first expression, so you can't use global state with an early reduction to track if the if branch is dead or not. If we used the C style cond ? a : b it could be cond { set flag } '?' expr { reset flag } : expr { reset flag } scanner checks flag and ignores the event if false This would need changing the JSON files but that should be easy enough with a script. Or maybe at some point need to bite the bullet and build an AST, but for now we probably can get away of not doing it. -Andi
Re: [PATCH] perf stat: Use proper cpu for shadow stats
> CPU0 109,621,384 cycles > CPU1 159,026,454 cycles > CPU2 99,460,366 cycles > CPU3 124,144,142 cycles > CPU0 44,396,706 instructions #0.41 insn per cycle > CPU1 120,195,425 instructions #0.76 insn per cycle > CPU2 44,763,978 instructions #0.45 insn per cycle > CPU3 69,049,079 instructions #0.56 insn per cycle > >1.001910444 seconds time elapsed Yes makes a lot of sense. Thanks for tracking that down. Reviewed-by: Andi Kleen -Andi
Re: [RFC PATCH 00/12] Topdown parser
On Wed, Nov 11, 2020 at 08:09:49PM -0800, Ian Rogers wrote: > > to the optimization manual the group Topdown_Group_TopDownL1 > provides the > > > metrics Topdown_Metric_Frontend_Bound, Topdown_Metric_Backend_Bound, > Topdown_Metric_Bad_Speculation > > and Topdown_Metric_Retiring. The hope is the events here will all > be > > scheduled without multiplexing. > > That's not necessarily true. Some of the newer expressions are quite > complex (e.g .due to workarounds or because the events are complex, like > the FLOPS events) There's also some problems with the > scheduling of the fixed metrics on Icelake+, that need special handling. > >For FLOPS I see: >( #FP_Arith_Scalar + #FP_Arith_Vector ) / ( 2 * CORE_CLKS ) >is the concern about multiplexing? Could we create metrics/groups aware of >the limitations? If you expand it you'll end up with a lot of events, so it has to be split into groups. But you still need to understand the rules, otherwise the tool ends up with non schedulable groups. For example here's a group schedule generated by toplev for level 3 Icelake. On pre Icelake with only 4 counters it's more complicated. Microcode_Sequencer[3] Heavy_Operations[1] Memory_Bound[1] Branch_Mispredicts[1] Fetch_Bandwidth[1] Other[4] Heavy_Operations[3] Frontend_Bound[1] FP_Arith[1] Backend_Bound[1] Light_Operations[3] Microcode_Sequencer[1] FP_Arith[4] Fetch_Bandwidth[2] Light_Operations[1] Retiring[1] Bad_Speculation[1] Machine_Clears[1] Other[1] Core_Bound[1] Ports_Utilization[1]: perf_metrics.frontend_bound[f] perf_metrics.bad_speculation[f] topdown.slots[f] perf_metrics.backend_bound[f] perf_metrics.retiring[f] [0 counters] Machine_Clears[2] Branch_Mispredicts[2] ITLB_Misses[3] Branch_Mispredicts[1] Fetch_Bandwidth[1] Machine_Clears[1] Frontend_Bound[1] Backend_Bound[1] ICache_Misses[3] Fetch_Latency[2] Fetch_Bandwidth[2] Bad_Speculation[1] Memory_Bound[1] DSB_Switches[3]: inst_retired.any[f] machine_clears.count cpu_clk_unhalted.thread[f] int_misc.recovery_cycles:c1:e1 br_misp_retired.all_branches idq_uops_not_delivered.cycles_0_uops_deliv.core topdown.slots[f] dsb2mite_switches.penalty_cycles icache_64b.iftag_stall int_misc.uop_dropping icache_16b.ifdata_stall [8 counters] Core_Bound[1] Core_Bound[2] Heavy_Operations[3] Memory_Bound[2] Light_Operations[3]: cycle_activity.stalls_mem_any idq.ms_uops exe_activity.1_ports_util exe_activity.exe_bound_0_ports exe_activity.2_ports_util exe_activity.bound_on_stores int_misc.recovery_cycles:c1:e1 uops_issued.any [8 counters] Microcode_Sequencer[3] Store_Bound[3] Branch_Resteers[3] MS_Switches[3] Divider[3] LCP[3]: int_misc.clear_resteer_cycles arith.divider_active cpu_clk_unhalted.thread[f] idq.ms_switches ild_stall.lcp baclears.any exe_activity.bound_on_stores idq.ms_uops uops_issued.any [8 counters] L1_Bound[3] L3_Bound[3]: cycle_activity.stalls_l1d_miss cycle_activity.stalls_mem_any cycle_activity.stalls_l2_miss cpu_clk_unhalted.thread[f] cycle_activity.stalls_l3_miss [4 counters] DSB[3] MITE[3]: idq.mite_cycles_ok idq.mite_cycles_any cpu_clk_unhalted.distributed idq.dsb_cycles_any idq.dsb_cycles_ok [5 counters] LSD[3] L2_Bound[3]: cycle_activity.stalls_l1d_miss cpu_clk_unhalted.thread[f] cpu_clk_unhalted.distributed lsd.cycles_ok lsd.cycles_active cycle_activity.stalls_l2_miss [5 counters] Other[4] FP_Arith[4] L2_Bound[3] DRAM_Bound[3]: mem_load_retired.fb_hit mem_load_retired.l2_hit fp_arith_inst_retired.512b_packed_single mem_load_retired.l1_miss fp_arith_inst_retired.512b_packed_double l1d_pend_miss.fb_full_periods [6 counters] Ports_Utilization[3] DRAM_Bound[3]: cycle_activity.stalls_l1d_miss arith.divider_active mem_load_retired.l2_hit exe_activity.1_ports_util cpu_clk_unhalted.thread[f] cycle_activity.stalls_l3_miss exe_activity.2_ports_util cycle_activity.stalls_l2_miss exe_activity.exe_bound_0_ports [8 counters] Other[4] FP_Arith[4]: fp_arith_inst_retired.128b_packed_single uops_executed.thread uops_executed.x87 fp_arith_inst_retired.scalar_double fp_arith_inst_retired.256b_packed_single fp_arith_inst_retired.scalar_single fp_arith_inst_retired.128b_packed_double fp_arith_inst_retired.256b_packed_double [8 counters] >Ok, so we can read the threshold from the spreadsheet and create an extra >metric for if the metric above the threshold? Yes it can be all derived from the spreadsheet. You need a much more complicated evaluation algorithm though, single pass is not enough. > > Also in other cases it's probably better to not drilldown, but collect > everything upfront, e.g. when someone else is doing the collection > for you. In this case the thresholding has to be figured out from > existing data. > >This sounds like perf record support for metrics, which I think is a good >idea but not currently a use-case I'm trying to solve. I meant just a single
Re: [RFC PATCH 00/12] Topdown parser
Hi Ian, On Wed, Nov 11, 2020 at 03:49:10PM -0800, Ian Rogers wrote: > FWIW I did something similar in python (that's how the current metrics > json files were generated from the spreadsheet) and it was a lot > simpler and shorter in a higher level language. > >The use of C++ here was intended to be a high-level language usage :-) FWIW this is our script for the metrics. https://github.com/intel/event-converter-for-linux-perf/blob/master/extract-tmam-metrics.py It has a few hacks, but overall isn't too bad. > > > One problem I see with putting the full TopDown model into perf is > that to do a full job it requires a lot more infrastructure that is > currently not implemented in metrics: like an event scheduler, > hierarchical thresholding over different nodes, SMT mode support etc. > > I implemented it all in toplev, but it was a lot of work to get it all > right. > I'm not saying it's not doable, but it will be a lot of additional work > to work out all the quirks using the metrics infrastructure. > > I think adding one or two more levels is probably ok, but doing all > levels > without these mechanisms might be difficult to use in the end. > >Thanks Andi, I'm working from the optimization manual and trying to >understand this. With this contribution we have both metrics and >groups that correspond to the levels in the tree. Following a similar flow >to the optimization manual the group Topdown_Group_TopDownL1 provides the >metrics Topdown_Metric_Frontend_Bound, Topdown_Metric_Backend_Bound, > Topdown_Metric_Bad_Speculation >and Topdown_Metric_Retiring. The hope is the events here will all be >scheduled without multiplexing. That's not necessarily true. Some of the newer expressions are quite complex (e.g .due to workarounds or because the events are complex, like the FLOPS events) There's also some problems with the scheduling of the fixed metrics on Icelake+, that need special handling. > Let's say Topdown_Metric_Backend_Bound is >an outlier and so then you re-run perf to drill into the group >Topdown_Group_Backend which will provide the >metrics Topdown_Metric_Memory_Bound and Topdown_Metric_Core_Bound. If the >metric Topdown_Metric_Memory_Bound is the outlier then you can use the >group Topdown_Group_Memory_Bound to drill into DRAM, L1, L2, L3, PMM and I think at a minimum you would need a threshold concept for this, otherwise the users don't know what passed or what didn't, and where to drill down. Also in modern TMAM some events thresholds are not only based on their parents but based on triggers cross true (e.g. $issue). Doing this by hand is somewhat tricky. BTW toplev has an auto drilldown mode now to automate this (--drilldown) Also in other cases it's probably better to not drilldown, but collect everything upfront, e.g. when someone else is doing the collection for you. In this case the thresholding has to be figured out from existing data. The other biggie which is currently not in metrics is per core mode, which is needed for many metrics on CPUs older than Icelake. This really has to be supported in some way, otherwise the results on pre Icelake SMT are not good (Icelake fixes this problem) >like SMT_on that may appear in the spreadsheet in expressions like: >( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else CLKS >These are passed through and then the perf expression parser will at >runtime compute the SMT_ON value using smt_on() that reads >from devices/system/cpu/cpu%d/topology/thread_siblings. Here is a >generated example for Skylakex: Yes I implemented that. But that's not enough for the full TMAM, you also need per core mode for these (but not for others). At the level 1 we get away by either being in per core mode or not, but that doesn't work deeper in the hierarchy. >SMT_ON needs evaluating early. This is a latent perf metric issue that >would benefit everyone if fixed, so I'll try to address it in a separate >change. >Event scheduling is missing and as discussed there are some naive >heuristics currently in perf. It seems we can improve this, for example a >group with events {A,B,C} and another with {B,C,D} could be merged to >create {A,B,C,D} but we'd need to know the number of counters. We could It's more complicated with OCR and similar special events, which can fit only a limited number into a group. Also on Icelake you need to know about the 4 vs 8 counter constraints, so it requires some knowledge of the event list. With the fixed metrics there are also special rules, like slots needs to be first and be followed by the metrics. Also if you don't do some deduping you end up with a lot of redundant events being scheduled. Also when there are multiple levels of expressions usually there is a need for sub grouping by levels etc. It's not a simple algorithm. >provide
Re: [RFC PATCH 00/12] Topdown parser
On Tue, Nov 10, 2020 at 02:03:34AM -0800, Ian Rogers wrote: > This RFC is for a new tool that reads TMA_Metrics.csv as found on > download.01.org/perfmon and generates metrics and metric groups from > it. To show the functionality the TMA_Metrics.csv is downloaded, but > an accepted change would most likely include a copy of this file from > Intel. With this tool rather than just level 1 topdown metrics, a full > set of topdown metrics to level 4 are generated. I'm not sure I understand the motivation for making the spreadsheet parsing part of the perf tool? It only needs to run once when the metrics are generated to build perf. FWIW I did something similar in python (that's how the current metrics json files were generated from the spreadsheet) and it was a lot simpler and shorter in a higher level language. One problem I see with putting the full TopDown model into perf is that to do a full job it requires a lot more infrastructure that is currently not implemented in metrics: like an event scheduler, hierarchical thresholding over different nodes, SMT mode support etc. I implemented it all in toplev, but it was a lot of work to get it all right. I'm not saying it's not doable, but it will be a lot of additional work to work out all the quirks using the metrics infrastructure. I think adding one or two more levels is probably ok, but doing all levels without these mechanisms might be difficult to use in the end. -Andi
Re: [PATCH 24/24] perf record: Add --buildid-mmap option to enable mmap's build id
On Mon, Nov 09, 2020 at 10:54:15PM +0100, Jiri Olsa wrote: > Adding --buildid-mmap option to enable build id in mmap2 events. > It will only work if there's kernel support for that and it disables > build id cache (implies --no-buildid). What's the point of the option? Why not enable it by default if the kernel supports it? With the option most user won't get the benefit. The only reason I can think of for an option would be to disable so that old tools can still process. -Andi
Re: [PATCH v12 01/11] perf/x86: Fix variable types for LBR registers
On Sat, Jun 13, 2020 at 04:09:46PM +0800, Like Xu wrote: > From: Wei Wang > > The MSR variable type can be 'unsigned int', which uses less memory than > the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't > be a negative number, so make it 'unsigned int' as well. Hi, What's the status of this patchkit? It would be quite useful to me (and various other people) to use LBRs in guest. I reviewed it earlier and the patches all looked good to me. But i don't see it in any -next tree. Reviewed-by: Andi Kleen Could it please be merged? Thanks, -Andi
Re: [PATCH v2 0/4] perf: Fix perf_event_attr::exclusive rotation
On Mon, Nov 02, 2020 at 03:16:25PM +0100, Peter Zijlstra wrote: > On Sun, Nov 01, 2020 at 07:52:38PM -0800, Andi Kleen wrote: > > The main motivation is actually that the "multiple groups" algorithm > > in perf doesn't work all that great: it has quite a few cases where it > > starves groups or makes the wrong decisions. That is because it is very > > difficult (likely NP complete) problem and the kernel takes a lot > > of short cuts to avoid spending too much time on it. > > The event scheduling should be starvation free, except in the presence > of pinned events. > > If you can show starvation without pinned events, it's a bug. > > It will also always do equal or better than exclusive mode wrt PMU > utilization. Again, if it doesn't it's a bug. Simple example (I think we've shown that one before): (on skylake) $ cat /proc/sys/kernel/nmi_watchdog 0 $ perf stat -e instructions,cycles,frontend_retired.latency_ge_2,frontend_retired.latency_ge_16 -a sleep 2 Performance counter stats for 'system wide': 654,514,990 instructions #0.34 insn per cycle (50.67%) 1,924,297,028 cycles (74.28%) 21,708,935 frontend_retired.latency_ge_2 (75.01%) 1,769,952 frontend_retired.latency_ge_16 (24.99%) 2.002426541 seconds time elapsed The second frontend_retired should be both getting 50% and the fixed events should be getting 100%. So several events are starved. Another similar example is trying to schedule the topdown events on Icelake in parallel to other groups. It works with one extra group, but breaks with two. (on icelake) $ cat /proc/sys/kernel/nmi_watchdog 0 $ perf stat -e '{slots,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring},{branches,branches,branches,branches,branches,branches,branches,branches},{branches,branches,branches,branches,branches,branches,branches,branches}' -a sleep 1 Performance counter stats for 'system wide': 71,229,087 slots (60.65%) 5,066,320 topdown-bad-spec # 7.1% bad speculation (60.65%) 35,080,387 topdown-be-bound # 49.2% backend bound (60.65%) 22,769,750 topdown-fe-bound # 32.0% frontend bound (60.65%) 8,336,760 topdown-retiring # 11.7% retiring (60.65%) 424,584 branches (70.00%) 424,584 branches (70.00%) 424,584 branches (70.00%) 424,584 branches (70.00%) 424,584 branches (70.00%) 424,584 branches (70.00%) 424,584 branches (70.00%) 424,584 branches (70.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 3,634,075 branches (30.00%) 1.001312511 seconds time elapsed A tool using exclusive hopefully will be able to do better than this. -Andi
Re: [PATCH v2 0/4] perf: Fix perf_event_attr::exclusive rotation
> hm, it's too late for me to check ;-) but should I be able to do > this with exclusive event.. running both command at the same time: Yes. The exclusive part only applies during a given context, but the two commands are different contexts. You would only see a difference when in the same context, and you have multiple groups (or events) that could in theory schedule in parallel e.g. something like perf stat -e '{cycles,cycles},{cycles,cycles}' ... The main motivation is actually that the "multiple groups" algorithm in perf doesn't work all that great: it has quite a few cases where it starves groups or makes the wrong decisions. That is because it is very difficult (likely NP complete) problem and the kernel takes a lot of short cuts to avoid spending too much time on it. With exclusive it will be possible for a tool to generate "perfect groups" in user space and assume the kernel schedules it dumbly, but at least without any starvation. -Andi
[PATCH v3] perf tools: Support -x for perf stat report
Add support for the -x, option to enable CSV output with perf stat report. Useful to parse the information with other programs. % perf stat record --quiet -a -I 1000 sleep 5 % perf stat report -x, 1.000838672,4003.55,msec,cpu-clock,4003548736,100.00,, 1.000838672,11243,,context-switches,4003631885,100.00,0.003,M/sec 1.000838672,1682,,cpu-migrations,4003672150,100.00,0.420,K/sec 1.000838672,13244,,page-faults,4003697471,100.00,0.003,M/sec 1.000838672,2953214077,,cycles,4003715495,100.00,0.738,GHz 1.000838672,4380820799,,instructions,4003738270,100.00,1.48,insn per cycle 1.000838672,809865653,,branches,4003760616,100.00,202.287,M/sec 1.000838672,12439843,,branch-misses,4003780785,100.00,1.54,of all branches ... Signed-off-by: Andi Kleen --- v2: Fix default output (Jiri). Also handle \t as special value like the original parser (Andi) v3: Use DEFAULT_SEPARATOR --- tools/perf/builtin-stat.c | 9 + 1 file changed, 9 insertions(+) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 743fe47e7a88..9fcc7351ce43 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -1988,6 +1988,8 @@ static int __cmd_report(int argc, const char **argv) "aggregate counts per numa node", AGGR_NODE), OPT_SET_UINT('A', "no-aggr", _stat.aggr_mode, "disable CPU count aggregation", AGGR_NONE), + OPT_STRING('x', "field-separator", _config.csv_sep, "separator", + "print counts with custom separator"), OPT_END() }; struct stat st; @@ -2002,6 +2004,13 @@ static int __cmd_report(int argc, const char **argv) input_name = "perf.data"; } + if (strcmp(stat_config.csv_sep, DEFAULT_SEPARATOR)) { + stat_config.csv_output = true; + if (!strcmp(stat_config.csv_sep, "\\t")) + stat_config.csv_sep = "\t"; + stat_config.big_num = false; + } + perf_stat.data.path = input_name; perf_stat.data.mode = PERF_DATA_MODE_READ; -- 2.28.0
[PATCH v2] perf tools: Support -x for perf stat report
Add support for the -x, option to enable CSV output with perf stat report. Useful to parse the information with other programs. % perf stat record --quiet -a -I 1000 sleep 5 % perf stat report -x, 1.000838672,4003.55,msec,cpu-clock,4003548736,100.00,, 1.000838672,11243,,context-switches,4003631885,100.00,0.003,M/sec 1.000838672,1682,,cpu-migrations,4003672150,100.00,0.420,K/sec 1.000838672,13244,,page-faults,4003697471,100.00,0.003,M/sec 1.000838672,2953214077,,cycles,4003715495,100.00,0.738,GHz 1.000838672,4380820799,,instructions,4003738270,100.00,1.48,insn per cycle 1.000838672,809865653,,branches,4003760616,100.00,202.287,M/sec 1.000838672,12439843,,branch-misses,4003780785,100.00,1.54,of all branches ... Signed-off-by: Andi Kleen --- v2: Fix default output (Jiri). Also handle \t as special value like the original parser (Andi) --- tools/perf/builtin-stat.c | 9 + 1 file changed, 9 insertions(+) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 743fe47e7a88..2f7eb6b68344 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -1988,6 +1988,8 @@ static int __cmd_report(int argc, const char **argv) "aggregate counts per numa node", AGGR_NODE), OPT_SET_UINT('A', "no-aggr", _stat.aggr_mode, "disable CPU count aggregation", AGGR_NONE), + OPT_STRING('x', "field-separator", _config.csv_sep, "separator", + "print counts with custom separator"), OPT_END() }; struct stat st; @@ -2002,6 +2004,13 @@ static int __cmd_report(int argc, const char **argv) input_name = "perf.data"; } + if (strcmp(stat_config.csv_sep, " ")) { + stat_config.csv_output = true; + if (!strcmp(stat_config.csv_sep, "\\t")) + stat_config.csv_sep = "\t"; + stat_config.big_num = false; + } + perf_stat.data.path = input_name; perf_stat.data.mode = PERF_DATA_MODE_READ; -- 2.28.0
[PATCH 2/2] perf tools: Support -x for perf stat report
Add support for the -x, option to enable CSV output with perf stat report. Useful to parse the information with other programs. % perf stat record --quiet -a -I 1000 sleep 5 % perf stat report -x, 1.000838672,4003.55,msec,cpu-clock,4003548736,100.00,, 1.000838672,11243,,context-switches,4003631885,100.00,0.003,M/sec 1.000838672,1682,,cpu-migrations,4003672150,100.00,0.420,K/sec 1.000838672,13244,,page-faults,4003697471,100.00,0.003,M/sec 1.000838672,2953214077,,cycles,4003715495,100.00,0.738,GHz 1.000838672,4380820799,,instructions,4003738270,100.00,1.48,insn per cycle 1.000838672,809865653,,branches,4003760616,100.00,202.287,M/sec 1.000838672,12439843,,branch-misses,4003780785,100.00,1.54,of all branches ... Signed-off-by: Andi Kleen --- tools/perf/builtin-stat.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index 743fe47e7a88..31e7bd877f1d 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -1988,6 +1988,8 @@ static int __cmd_report(int argc, const char **argv) "aggregate counts per numa node", AGGR_NODE), OPT_SET_UINT('A', "no-aggr", _stat.aggr_mode, "disable CPU count aggregation", AGGR_NONE), + OPT_STRING('x', "field-separator", _config.csv_sep, "separator", + "print counts with custom separator"), OPT_END() }; struct stat st; @@ -2002,6 +2004,11 @@ static int __cmd_report(int argc, const char **argv) input_name = "perf.data"; } + if (stat_config.csv_sep) { + stat_config.csv_output = true; + stat_config.big_num = false; + } + perf_stat.data.path = input_name; perf_stat.data.mode = PERF_DATA_MODE_READ; -- 2.28.0
[PATCH 1/2] perf tools: Add --quiet option to perf stat
Add a new --quiet option to perf stat. This is useful with perf stat record to write the data only to the perf.data file, which can lower measurement overhead because the data doesn't need to be formatted. On my 4C desktop: % time ./perf stat record -e $(python -c 'print ",".join(["cycles"]*1000)') -a -I 1000 sleep 5 ... real0m5.377s user0m0.238s sys 0m0.452s % time ./perf stat record --quiet -e $(python -c 'print ",".join(["cycles"]*1000)') -a -I 1000 sleep 5 real0m5.452s user0m0.183s sys 0m0.423s In this example it cuts the user time by 20%. On systems with more cores the savings are higher. Signed-off-by: Andi Kleen --- tools/perf/Documentation/perf-stat.txt | 4 tools/perf/builtin-stat.c | 6 +- tools/perf/util/stat.h | 1 + 3 files changed, 10 insertions(+), 1 deletion(-) diff --git a/tools/perf/Documentation/perf-stat.txt b/tools/perf/Documentation/perf-stat.txt index 9f9f29025e49..b138dd192423 100644 --- a/tools/perf/Documentation/perf-stat.txt +++ b/tools/perf/Documentation/perf-stat.txt @@ -316,6 +316,10 @@ small group that need not have multiplexing is lowered. This option forbids the event merging logic from sharing events between groups and may be used to increase accuracy in this case. +--quiet:: +Don't print output. This is useful with perf stat record below to only +write data to the perf.data file. + STAT RECORD --- Stores stat data into perf data file. diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c index b01af171d94f..743fe47e7a88 100644 --- a/tools/perf/builtin-stat.c +++ b/tools/perf/builtin-stat.c @@ -972,6 +972,8 @@ static void print_counters(struct timespec *ts, int argc, const char **argv) /* Do not print anything if we record to the pipe. */ if (STAT_RECORD && perf_stat.data.is_pipe) return; + if (stat_config.quiet) + return; perf_evlist__print_counters(evsel_list, _config, , ts, argc, argv); @@ -1171,6 +1173,8 @@ static struct option stat_options[] = { "threads of same physical core"), OPT_BOOLEAN(0, "summary", _config.summary, "print summary for interval mode"), + OPT_BOOLEAN(0, "quiet", _config.quiet, + "don't print output (useful with record)"), #ifdef HAVE_LIBPFM OPT_CALLBACK(0, "pfm-events", _list, "event", "libpfm4 event selector. use 'perf list' to list available events", @@ -2132,7 +2136,7 @@ int cmd_stat(int argc, const char **argv) goto out; } - if (!output) { + if (!output && !stat_config.quiet) { struct timespec tm; mode = append_file ? "a" : "w"; diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h index 487010c624be..05adf8165025 100644 --- a/tools/perf/util/stat.h +++ b/tools/perf/util/stat.h @@ -122,6 +122,7 @@ struct perf_stat_config { bool metric_no_group; bool metric_no_merge; bool stop_read_counter; + bool quiet; FILE*output; unsigned int interval; unsigned int timeout; -- 2.28.0
Re: [RFC] perf evlist: Warn if event group has mixed sw/hw events
On Mon, Oct 26, 2020 at 11:19:37PM +0900, Namhyung Kim wrote: > This patch just added a warning before running it. I'd really want to > fix the kernel if possible but don't have a good idea. Thoughts? The easiest fix would be some multi threading in perf stat opening, then then extra latencies could be mostly hidden. One thread per group would probably be overkill, but just a few threads would lower the penalty significantly. I think that would be better than this patch and it's likely not that much more complicated, as this is already a lot of code. > +{ > + const char *known_sw_pmu[] = { > + "software", "tracepoint", "breakpoint", "kprobe", "uprobe", > "msr" That's a non scalable approach. New pmus get added regularly. It would be better to indicate this in a generic way from the kernel. > + pr_warning("WARNING: Event group has mixed hw/sw > events.\n" > +"This will slow down the perf_event_open > syscall.\n" > +"Consider putting a hw event as a > leader.\n\n"); You really need to tell the user which group, otherwise it is hard to find in a large command line. -Andi
Re: [PATCH 1/2] perf test: Use generic event for expand_libpfm_events()
On Sat, Oct 24, 2020 at 11:59:17AM +0900, Namhyung Kim wrote: > I found that the UNHALTED_CORE_CYCLES event is only available in the > Intel machines and it makes other vendors/archs fail on the test. As > libpfm4 can parse the generic events like cycles, let's use them. > > Fixes: 40b74c30ffb9 ("perf test: Add expand cgroup event test") > Signed-off-by: Namhyung Kim So would the test still fail when libpfm is not compiled in? -Andi
Re: [PATCH] perf stat: Support regex pattern in --for-each-cgroup
On Fri, Oct 23, 2020 at 04:42:34PM +0900, Namhyung Kim wrote: > To make the command line even more compact with cgroups, support regex > pattern matching in cgroup names. > > $ perf stat -a -e cpu-clock,cycles --for-each-cgroup '^.$' sleep 1 The example doesn't exactly show the benefit. So ^.$ would be only for one character cgroups? Missing documentation updates. -Andi
Re: [PATCH] perf/x86/intel: make anythread filter support conditional
On Wed, Oct 21, 2020 at 02:16:12PM -0700, Stephane Eranian wrote: > Starting with Arch Perfmon v5, the anythread filter on generic counters may be > deprecated. The current kernel was exporting the any filter without checking. > On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter > does not exist. This patch corrects the problem by relying on the CPUID 0xa > leaf > function to determine if anythread is supported or not as described in the > Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section. Reviewed-by: Andi Kleen -Andi
Re: [PATCH v1 02/15] perf report: output trace file name in raw trace dump
> > After brief evaluation it still doesn't look easy. The simplest thing > > I could imagine is to probably combine file_path and file_offset at a > > struct object on stack and then pass the object by the reference along > > function calls. I expect it will roughly cause the same amount of changes > > in the code and saves one argument (GP register). It is not that much > > but still. However I don't see issues with passing even 6 args on stack. > > Sorry - "passing 6 args to a function call" Ah it wasn't about code efficiency, just maintainability. Function calls with too many arguments are generally hard to maintain, and typically at some point have to be refactored anyways because they tend to grow even more over time. But if it's too difficult, ok. -Andi
Re: [PATCH] perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY
On Mon, Oct 19, 2020 at 08:01:58AM -0700, kan.li...@linux.intel.com wrote: > Add a line for the CYCLE_ACTIVITY.STALLS_MEM_ANY event in the ICL > constraint table. > Correct the comments for the CYCLE_ACTIVITY.CYCLES_MEM_ANY event. Thanks Kan. > > Reported-by: Andi Kleen > Signed-off-by: Kan Liang I guess this should have a Fixes: tag and also be proposed for stable. -Andi