Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Andi Kleen
> We fall back to the rmap when it's obviously not smart to do so. There
> is still a lot of room for improvement in this function though, i.e.,
> it should be per VMA and NUMA aware.

Okay so it's more a question to tune the cross over heuristic. That
sounds much easier than replacing everything.

Of course long term it might be a problem to maintain too many 
different ways to do things, but I suppose short term it's a reasonable
strategy.

-Andi



Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-14 Thread Andi Kleen
> The CNA code, if enabled, will be in vmlinux, not in a kernel module. As a
> result, I think a module parameter will be no different from a kernel
> command line parameter in this regard.

You can still change it in /sys at runtime, even if it's in the vmlinux.

-Andi


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Andi Kleen
> Now imagine we have an 8 node system, and memory
> pressure in the DMA32 zone of node 0.

The question is how much do we still care about DMA32.
If there are problems they can probably just turn on the IOMMU for
these IO mappings.

-Andi


Re: [PATCH v2 00/16] Multigenerational LRU Framework

2021-04-14 Thread Andi Kleen
>2) It will not scan PTE tables under non-leaf PMD entries that do not
>   have the accessed bit set, when
>   CONFIG_HAVE_ARCH_PARENT_PMD_YOUNG=y.

This assumes  that workloads have reasonable locality. Could there
be a worst case where only one or two pages in each PTE are used,
so this PTE skipping trick doesn't work?

-Andi


Re: [External] : Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Andi Kleen
> > ms granularity seems very coarse grained for this. Surely
> > at some point of spinning you can afford a ktime_get? But ok.
> We are reading time when we are at the head of the (main) queue, but
> don’t have the lock yet. Not sure about the latency of ktime_get(), but
> anything reasonably fast but not necessarily precise should work.

Actually cpu_clock / sched_clock (see my other email). These should
be fast without corner cases and also monotonic.

> 
> > Could you turn that into a moduleparm which can be changed at runtime?
> > Would be strange to have to reboot just to play with this parameter
> Yes, good suggestion, thanks.
> 
> > This would also make the code a lot shorter I guess.
> So you don’t think we need the command-line parameter, just the module_param?

module_params can be changed at the command line too, so yes.

-Andi


Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Andi Kleen
Andi Kleen  writes:

> Alex Kogan  writes:
>>  
>> +numa_spinlock_threshold=[NUMA, PV_OPS]
>> +Set the time threshold in milliseconds for the
>> +number of intra-node lock hand-offs before the
>> +NUMA-aware spinlock is forced to be passed to
>> +a thread on another NUMA node.  Valid values
>> +are in the [1..100] range. Smaller values result
>> +in a more fair, but less performant spinlock,
>> +and vice versa. The default value is 10.
>
> ms granularity seems very coarse grained for this. Surely
> at some point of spinning you can afford a ktime_get? But ok.

Actually thinking about it more using jiffies is likely broken
anyways because if the interrupts are disabled and the CPU
is running the main timer interrupts they won't increase.

cpu_clock (better than ktime_get) or sched_clock would work.

-Andi


Re: [PATCH v14 4/6] locking/qspinlock: Introduce starvation avoidance into CNA

2021-04-13 Thread Andi Kleen
Alex Kogan  writes:
>  
> + numa_spinlock_threshold=[NUMA, PV_OPS]
> + Set the time threshold in milliseconds for the
> + number of intra-node lock hand-offs before the
> + NUMA-aware spinlock is forced to be passed to
> + a thread on another NUMA node.  Valid values
> + are in the [1..100] range. Smaller values result
> + in a more fair, but less performant spinlock,
> + and vice versa. The default value is 10.

ms granularity seems very coarse grained for this. Surely
at some point of spinning you can afford a ktime_get? But ok.

Could you turn that into a moduleparm which can be changed at runtime?
Would be strange to have to reboot just to play with this parameter

This would also make the code a lot shorter I guess.

-Andi


Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers

2021-04-12 Thread Andi Kleen
> The reason why soft lockup happens may be the unmapped EPT pages. So, do we
> have a way to map all gpa
> before we use pebs on Skylake?

Can you configure a VT-d device, that will implicitly pin all pages for the
IOMMU. I *think* that should be enough for testing.

-Andi


Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-11 Thread Andi Kleen
> I have actually seen real user programs poke MSR_SYSCALL_MASK.

Hmm, what was the use case?

-Andi


Re: [PATCH] x86/msr: Block writes to certain MSRs unconditionally

2021-04-10 Thread Andi Kleen
Borislav Petkov  writes:

> From: Borislav Petkov 
> Date: Sat, 10 Apr 2021 14:08:13 +0200
>
> There are a bunch of MSRs which luserspace has no business poking at,
> whatsoever. Add a ban list and put the TSC-related MSRs in there. Issue
> a big juicy splat to catch offenders.

Have you ever seen any user programs actually write those MSRs?
I don't see why they ever would, it's not that they have any motivation
to do it (unlike SMM), and I don't know of any examples.

The whole MSR blocking seems more like a tilting at windmills
type effort. Root kits typically write from the kernel anyways. And the
only results we have so far is various legitimate debug
and benchmark utilities running much slower due to them flooding the
kernel log with warnings.

I can see that there are security reasons to lock down MSRs, but that is
already handled fine with existing sandbox and lockdown mechanisms. But
on a non locked down system fully accessible MSRs are really useful for
all kind of debugging and tuning, and anything that prevents that
is bad.

-Andi


Re: [PATCH v4 12/12] perf session: use reader functions to load perf data file

2021-04-08 Thread Andi Kleen


Except where I commented, for the series

Acked-by: Andi Kleen 

-Andi


Re: [PATCH v4 09/12] perf record: document parallel data streaming mode

2021-04-08 Thread Andi Kleen
> +--threads=::
> +Write collected trace data into several data files using parallel threads.
> + value can be user defined list of masks. Masks separated by colon
> +define cpus to be monitored by a thread and affinity mask of that thread
> +is separated by slash. For example user specification like the following:
> +/:/ specifies

You need to be more clear on the exact syntax of a mask. Ideally
some full examples too.

> +parallel threads layout that consists of two threads with corresponding
> +assigned cpus to be monitored.  value can also be a string meaning
> +predefined parallel threads layout:
> +cpu- create new data streaming thread for every monitored cpu
> +core   - create new thread to monitor cpus grouped by a core
> +socket - create new thread to monitor cpus grouped by a socket
> +numa   - create new threed to monitor cpus grouped by a numa domain
> +Predefined layouts can be used on systems with large number of cpus in
> +order not to spawn multiple per-cpu streaming threads but still avoid LOST
> +events in data directory files. Option specified with no or empty value
> +defaults to cpu layout. Masks defined or provided by the option value are
> +filtered through the mask provided by -C option.
>  
> 
> 


Re: [PATCH v4 05/12] perf record: start threads in the beginning of trace streaming

2021-04-08 Thread Andi Kleen
> + err = write(thread->pipes.ack[1], , sizeof(msg));
> + if (err == -1)
> + pr_err("threads[%d]: failed to notify on start. Error %m", 
> thread->tid);

It might be safer to not use %m. I'm not sure if all the non glibc
libcs that people use support it.



Re: [PATCH v4 02/12] perf record: introduce thread specific data array

2021-04-08 Thread Andi Kleen
> + } else {
> + thread_data[t].tid = syscall(SYS_gettid);

That always fills in the tid of the setup thread instead of the target
threads?


Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-04-07 Thread Andi Kleen
> Hmm, I forgot about that quirk.  I would expect the TDX Module to inject a #GP
> for that case.  I can't find anything in the spec that confirms or denies 
> that,
> but injecting #VE would be weird and pointless.
> 
> Andi/Sathya, the TDX Module spec should be updated to state that XSETBV will
> #GP at CPL!=0.  If that's not already the behavior, the module should probably
> be changed...

I asked about this and the answer was that XSETBV behaves architecturally inside
a TD (no #VE), thus there is nothing to document.

-Andi


Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-07 Thread Andi Kleen
David Hildenbrand  writes:

> I have no idea how expensive would be bouncing writes (and reads?)
> through the kernel. Did you ever experiment with that/evaluate that?

I would expect it to be quite expensive, as in virtio IO performance
tanking.

-Andi


Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers

2021-04-07 Thread Andi Kleen
On Wed, Apr 07, 2021 at 11:05:20AM +0800, Liuxiangdong (Aven, Cloud 
Infrastructure Service Product Dept.) wrote:
> 
> 
> On 2021/4/6 20:47, Andi Kleen wrote:
> > > AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
> > > doesn't.
> > > But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
> > > counter in Skylake.
> > > Is there anything else that only Icelake supports in this patches set?
> > Only Icelake server has the support for recovering from a EPT violation
> > on the PEBS data structures. To use it on Skylake server you would
> > need to pin the whole guest, but that is currently not done.
> Sorry. Some questions about "Pin the whole guest". Do you mean VmPin equals
> VmSize
> in "/proc/$(pidof qemu-kvm)/status"? Or just VmLck equals VmSize? Or
> something else?

Either would be sufficient. All that matters is that the EPT pages don't get
unmapped ever while PEBS is active.

-Andi


Re: [RFCv1 7/7] KVM: unmap guest memory using poisoned pages

2021-04-07 Thread Andi Kleen
Christophe de Dinechin  writes:

> Is there even a theoretical way to restore an encrypted page e.g. from (host)
> swap without breaking the integrity check? Or will that only be possible with
> assistance from within the encrypted enclave?

Only the later.

You would need balloning. It's in principle possible, but currently
not implemented.

In general host swap without balloning is usually a bad idea anyways
because it often just swaps a lot of cache data that could easily be
thrown away instead.

-andi


Re: [PATCH] perf report: Fix wrong LBR block sorting

2021-04-07 Thread Andi Kleen
> Now the hottest block is reported at the top of output.
> 
> Fixes: b65a7d372b1a ("perf hist: Support block formats with 
> compare/sort/display")
> Signed-off-by: Jin Yao 


Reviewed-by: Andi Kleen 
-Andi


Re: [PATCH v4 01/16] perf/x86/intel: Add x86_pmu.pebs_vmx for Ice Lake Servers

2021-04-06 Thread Andi Kleen
> AFAIK, Icelake supports adaptive PEBS and extended PEBS which Skylake
> doesn't.
> But we can still use IA32_PEBS_ENABLE MSR to indicate general-purpose
> counter in Skylake.
> Is there anything else that only Icelake supports in this patches set?

Only Icelake server has the support for recovering from a EPT violation
on the PEBS data structures. To use it on Skylake server you would
need to pin the whole guest, but that is currently not done.

> Besides, we have tried this patches set in Icelake.  We can use pebs(eg:
> "perf record -e cycles:pp")
> when guest is kernel-5.11, but can't when kernel-4.18.  Is there a minimum
> guest kernel version requirement?

You would need a guest kernel that supports Icelake server PEBS. 4.18
would need backports for tht.


-Andi


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-03 Thread Andi Kleen
On Sat, Apr 03, 2021 at 09:26:24AM -0700, Dave Hansen wrote:
> On 4/2/21 2:32 PM, Andi Kleen wrote:
> >> If we go this route, what are the rules and restrictions?  Do we have to
> >> say "no MMIO in #VE"?
> > 
> > All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
> > After that it can nest without problems.
> 
> Well, not exactly.  You still can't do things that will could cause a n
> unbounded recusive #VE.

> It doesn't seem *that* far fetched to think that someone might try to
> defer some work or dump data to the console.

I believe the main console code has reentry protection.

I'm not sure about early_printk (with keep), buf it that's the case
it probably should be fixed anyways. I can take a look at that.

Not sure why deferring something would cause another #VE?

 
> > If you nest before that the TDX will cause a triple fault.
> > 
> > The code that cannot do it is a few lines in the early handler which
> > runs with interrupts off.
> 
> >> Which brings up another related point: How do you debug TD guests?  Does
> >> earlyprintk work?
> > 
> > Today it works actually because serial ports are allowed. But I expect it to
> > be closed eventually because serial code is a lot of code to audit. 
> > But you can always disable the filtering with a command line option and
> > then it will always work for debugging.
> 
> Do we need a TDX-specific earlyprintk?  I would imagine it's pretty easy
> to implement.

Don't see a need at this point, the existing mechanisms work.

Maybe if we ever have a problem that only happen in lockdown *and* happens
early, but that's not very likely since lock down primarily changes code
behavior later.

There are also other debug mechanisms for such cases: in TDX if you configure
the TD for debug mode supports using the gdb stub on the hypervisor.

-Andi



Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-02 Thread Andi Kleen
> If we go this route, what are the rules and restrictions?  Do we have to
> say "no MMIO in #VE"?

All we have to say is "No MMIO in #VE before getting thd TDVEINFO arguments"
After that it can nest without problems.

If you nest before that the TDX will cause a triple fault.

The code that cannot do it is a few lines in the early handler which
runs with interrupts off.

The TDX module also makes sure to not inject NMIs while we're in
that region, so NMIs are of no concern.

That was the whole point of avoiding the system call gap problem. We don't
need to make it IST, so it can nest.

I'm not aware of any other special rules.

> Which brings up another related point: How do you debug TD guests?  Does
> earlyprintk work?

Today it works actually because serial ports are allowed. But I expect it to
be closed eventually because serial code is a lot of code to audit. 
But you can always disable the filtering with a command line option and
then it will always work for debugging.

-Andi


Re: [PATCH v2] kconfig: config script: add a little user help

2021-04-02 Thread Andi Kleen
On Sat, Dec 19, 2020 at 09:08:05AM -0800, Randy Dunlap wrote:
> Give the user a clue about the problem along with the 35 lines of
> usage/help text.

Reviewed-by: Andi Kleen 

-Andi


Re: [RFC v1 00/26] Add TDX Guest Support

2021-04-01 Thread Andi Kleen
> I've heard things like "we need to harden the drivers" or "we need to do
> audits" and that drivers might be "whitelisted".

The basic driver allow listing patches are already in the repository,
but not currently posted or complete:

https://github.com/intel/tdx/commits/guest

> 
> What are we talking about specifically?  Which drivers?  How many
> approximately?  Just virtio?  

Right now just virtio, later other drivers that hypervisors need.

> Are there any "real" hardware drivers
> involved like how QEMU emulates an e1000 or rtl8139 device?  

Not currently (but some later hypervisor might rely on one of those)

> What about
> the APIC or HPET?

No IO-APIC, but the local APIC. No HPET.

> 
> How broadly across the kernel is this going to go?

Not very broadly for drivers.

> 
> Without something concrete, it's really hard to figure out if we should
> go full-blown paravirtualized MMIO, or do something like the #VE
> trapping that's in this series currently.

As Sean says the concern about MMIO is less drivers (which should
be generally ok if they work on other architectures which require MMIO
magic), but other odd code that only ran on x86 before.

I really don't understand your crusade against #VE. It really
isn't that bad if we can avoid the few corner cases.

For me it would seem wrong to force all MMIO for all drivers to some
complicated paravirt construct, blowing up code side everywhere
and adding complicated self modifying code, when it's only needed for very
few drivers. But we also don't want to patch every MMIO to be special cased
even those few drivers.

#VE based MMIO avoids all that cleanly while being nicely non intrusive.

-Andi



Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Andi Kleen
On Wed, Mar 31, 2021 at 08:46:18PM -0700, Dave Hansen wrote:
> On 3/31/21 8:28 PM, Andi Kleen wrote:
> >> The hardware (and VMMs and SEAM) have ways of telling the guest kernel
> >> what is supported: CPUID.  If it screws up, and the guest gets an
> >> unexpected #VE, so be it.
> > The main reason for disabling stuff is actually that we don't need
> > to harden it. All these things are potential attack paths.
> 
> Wait, MWAIT is an attack path?  If it were an attack path, wouldn't it

No MWAIT is not, but lots of other things that can be controlled by the
host are. And that will be a motivation to disable things.

> >> We don't have all kinds of crazy handling in the kernel's #UD handler
> >> just in case a CPU mis-enumerates a feature and we get a #UD.  We have
> >> to trust the underlying hardware to be sane.  If it isn't, we die a
> >> horrible death as fast as possible.  Why should TDX be any different?
> > That's what the original patch did -- no unnecessary checks -- but reviewers
> > keep asking for the extra checks, so Sathya added more. We have the not
> > unusual problem here that reviewers don't agree among themselves.
> 
> Getting consensus is a pain in the neck, eh?

Tt seems more like a circular argument currently.
> 
> It's too bad all the reviewers in the community aren't like all of the
> engineers at big companies where everyone always agrees. :)

I would propose to go back to the original patch without all the extra
checks. I think that's what you're arguing too. IIRC the person
who originally requested extra checks was Andy, if he's ok with 
that too we can do it, so that you guys can finally move on
to the other patches that actually do more than just trivial things.

-Andi


Re: [PATCH v4 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-31 Thread Andi Kleen
> The hardware (and VMMs and SEAM) have ways of telling the guest kernel
> what is supported: CPUID.  If it screws up, and the guest gets an
> unexpected #VE, so be it.

The main reason for disabling stuff is actually that we don't need
to harden it. All these things are potential attack paths.

> 
> We don't have all kinds of crazy handling in the kernel's #UD handler
> just in case a CPU mis-enumerates a feature and we get a #UD.  We have
> to trust the underlying hardware to be sane.  If it isn't, we die a
> horrible death as fast as possible.  Why should TDX be any different?

That's what the original patch did -- no unnecessary checks -- but reviewers
keep asking for the extra checks, so Sathya added more. We have the not
unusual problem here that reviewers don't agree among themselves.

-Andi


Re: [PATCH v1 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-30 Thread Andi Kleen
On Tue, Mar 30, 2021 at 12:56:41PM +0800, Xiaoyao Li wrote:
> On 3/27/2021 8:18 AM, Kuppuswamy Sathyanarayanan wrote:
> > In non-root TDX guest mode, MWAIT, MONITOR and WBINVD instructions
> > are not supported. So handle #VE due to these instructions as no ops.
> > 
> > Signed-off-by: Kuppuswamy Sathyanarayanan 
> > 
> > Reviewed-by: Andi Kleen 
> > ---
> > 
> > Changes since previous series:
> >   * Suppressed MWAIT feature as per Andi's comment.
> >   * Added warning debug log for MWAIT #VE exception.
> > 
> >   arch/x86/kernel/tdx.c | 23 +++
> >   1 file changed, 23 insertions(+)
> > 
> > diff --git a/arch/x86/kernel/tdx.c b/arch/x86/kernel/tdx.c
> > index e936b2f88bf6..fb7d22b846fc 100644
> > --- a/arch/x86/kernel/tdx.c
> > +++ b/arch/x86/kernel/tdx.c
> > @@ -308,6 +308,9 @@ void __init tdx_early_init(void)
> > setup_force_cpu_cap(X86_FEATURE_TDX_GUEST);
> > +   /* MWAIT is not supported in TDX platform, so suppress it */
> > +   setup_clear_cpu_cap(X86_FEATURE_MWAIT);
> 
> In fact, MWAIT bit returned by CPUID instruction is zero for TD guest. This
> is enforced by SEAM module.

Good point.
> 
> Do we still need to safeguard it by setup_clear_cpu_cap() here?

I guess it doesn't hurt to do it explicitly.


-Andi


Re: [PATCH v3 1/1] x86/tdx: Handle MWAIT, MONITOR and WBINVD

2021-03-29 Thread Andi Kleen
> > No, if these instructions take a #VE then they were executed at CPL=0.  
> > MONITOR
> > and MWAIT will #UD without VM-Exit->#VE.  Same for WBINVD, s/#UD/#GP.
> 
> Dare I ask about XSETBV?

XGETBV does not cause a #VE, it just works normally. The guest has full
AVX capabilities.

-Andi



Re: [RESEND PATCH v2] perf stat: improve readability of shadow stats

2021-03-21 Thread Andi Kleen
> > But is this a real problem?
> 
> perhaps not, Andi, any idea about this?

It's not a problem for my tools which don't use the unit,
but I could imagine one for other parsers. I would recommend
to not change it for CSV, which is expected to be parsed
by tools.

-Andi


Re: [RFC PATCH v3 00/18] dynamic debug diet plan

2021-03-18 Thread Andi Kleen
Jim Cromie  writes:

> CONFIG_DYNAMIC_DEBUG creates a struct _ddebug (56 bytes) for each
> callsite, which includes 3 pointers to: module, filename, function.
> These are repetetive, and compressible, this patch series goes about
> doing that, it:

So how much memory does it actually save?
-Andi


Re: [PATCH v2 11/27] perf parse-events: Support hardware events inside PMU

2021-03-18 Thread Andi Kleen
> While we're discussing, do we really want to use the "core" and "atom"
> terms here? I thought cpu/cycles/ would be ok for the main (Big) CPU and

Yes absolutely.

> that we should come up with some short name for the "litle" CPUs.

There actually isn't a main CPU. 

There's nothing "better" about the big cores vs the Atoms 
anyways. They're all important CPUs.

And the system might have no "big" CPUs, but we won't know
until we finished onlining all CPUs. 

Or on Lakefield there are four Atoms and only a single big core.
So with a non hybrid aware profiler tool you would miss most of the
system if we used cpu// for the big core.

Also I think renaming is a good idea because it forces the software
or configuration to handle hybrid. Otherwise you just get subtle breakage
all the time with some CPUs not getting profiled.

It's a similar strategy as we do in the source code when semantics
change.

ARM did this right.

-Andi



Re: [PATCH v2] perf stat: Align CSV output for summary mode

2021-03-17 Thread Andi Kleen
> If you care about not breaking existing scripts, then the output they
> get with what they use as command line options must continue to produce
> the same output.

It's not clear there are any useful ones (except for tools that handle
both). It's really hard to parse the previous mess. It's simply not
valid CSV.

That's why I'm arguing that keeping compatibility is not useful here.

We would be stuck with the broken mess as default forever.

-Andi


Re: [PATCH] perf stat: Align CSV output for summary mode

2021-03-16 Thread Andi Kleen
> Is it serious or just a joke? :)

I would prefer to not be compatible (at least not until someone complains),
but if compatibility is required then yes opting in to the broken
format would be better. Perhaps not with that name.

And the option could be hidden in the perf config file instead
of being on the command line.

-Andi


Re: [PATCH] perf stat: Align CSV output for summary mode

2021-03-16 Thread Andi Kleen
On Tue, Mar 16, 2021 at 04:05:13PM -0300, Arnaldo Carvalho de Melo wrote:
> Em Tue, Mar 16, 2021 at 09:34:21AM -0700, Andi Kleen escreveu:
> > > looks ok, but maybe make the option more related to CVS, like:
> > > 
> > >   --x-summary, --cvs-summary  ...? 
> > 
> > Actually I don't think it should be a new option. I doubt
> > anyone could parse the previous mess. So just make it default
> > with -x
> 
> In these cases I always fear that people are already parsing that mess
> by considering the summary lines to be the ones not starting with
> spaces, and now we go on and change it to be "better" by prefixing it
> with "summary" and... break existing scripts.

I think it was just one version or so?

FWIW perf has broken CSV output several times, I added workarounds
to toplev every time. Having a broken version for a short time
shouldn't be too bad.

I actually had a workaround for this one, but it can parse either way.

> 
> Can we do this with a new option?
> 
> I.e. like --cvs-summary?

If you do it I would add an option for the old broken format
--i-want-broken-csv. But not  require the option forever
just to get sane output.

Or maybe only a perf config option.

-Andi


Re: [PATCH] perf stat: Align CSV output for summary mode

2021-03-16 Thread Andi Kleen
> looks ok, but maybe make the option more related to CVS, like:
> 
>   --x-summary, --cvs-summary  ...? 

Actually I don't think it should be a new option. I doubt
anyone could parse the previous mess. So just make it default
with -x

-Andi


Re: [PATCH v4 00/25] Page folios

2021-03-15 Thread Andi Kleen
Michal Hocko  writes:
> bikeshedding) because it hasn't really resonated with the udnerlying
> concept. Maybe just me as a non native speaker... page_head would have
> been so much more straightforward but not something I really care
> about.

Yes. page_head explains exactly what it is.

But Folio? Huh?

-Andi



Re: [PATCH] perf-stat: introduce bperf, share hardware PMCs with BPF

2021-03-15 Thread Andi Kleen
> I still think that there is value in taking those measurements after we
> enable the counters, as, for instance, for interval mode we want
> measurements with the counters enabled, whatever happens before the
> counters are enabled is just startup time, etc. Jiri, Andi?

Yes I agree. Only the time with counters enabled matters.


-Andi


Re: [PATCH v1 10/14] mm: multigenerational lru: core

2021-03-14 Thread Andi Kleen
Yu Zhao  writes:
> +
> +#ifdef CONFIG_MEMCG
> + if (memcg && atomic_read(>moving_account))
> + goto contended;
> +#endif
> + if (!mmap_read_trylock(mm))
> + goto contended;

These are essentially spinloops. Surely you need a cpu_relax() somewhere?

In general for all of spinloop like constructs it would be useful to
consider how to teach lockdep about them.

> + do {
> + old_flags = READ_ONCE(page->flags);
> + new_gen = ((old_flags & LRU_GEN_MASK) >> LRU_GEN_PGOFF) - 1;
> + VM_BUG_ON_PAGE(new_gen < 0, page);
> + if (new_gen >= 0 && new_gen != old_gen)
> + goto sort;
> +
> + new_gen = (old_gen + 1) % MAX_NR_GENS;
> + new_flags = (old_flags & ~LRU_GEN_MASK) | ((new_gen + 1UL) << 
> LRU_GEN_PGOFF);
> + /* mark page for reclaim if pending writeback */
> + if (front)
> + new_flags |= BIT(PG_reclaim);
> + } while (cmpxchg(>flags, old_flags, new_flags) !=
> old_flags);

I see this cmpxchg flags pattern a lot. Could there be some common code
factoring?

-Andi


Re: [PATCH] perf stat: Create '--add-default' option to append default list

2021-03-12 Thread Andi Kleen
> A more concise syntax would be -e +event
> 
> The + would apply to the whole event list if there are multiple.
> 
> and maybe -event too to remove something from the default list.

Sorry that was an old email. Please ignore.

-Andi


Re: [PATCH] perf stat: Create '--add-default' option to append default list

2021-03-12 Thread Andi Kleen
On Tue, Dec 22, 2020 at 09:11:31AM +0800, Jin Yao wrote:
> The event default list includes the most common events which are widely
> used by users. But with -e option, the current perf only counts the events
> assigned by -e option. Users may want to collect some extra events with
> the default list. For this case, users have to manually add all the events
> from the default list. It's inconvenient. Also, users may don't know how to
> get the default list.
> 
> It's better to add a new option to append default list to the -e events.
> The new option is '--add-default'.

A more concise syntax would be -e +event

The + would apply to the whole event list if there are multiple.

and maybe -event too to remove something from the default list.

-Andi


Re: [PATCH 0/3] perf tools: Minor improvements in event synthesis

2021-03-12 Thread Andi Kleen
On Mon, Dec 21, 2020 at 04:00:26PM +0900, Namhyung Kim wrote:
> Hello,
> 
> This is to optimize the event synthesis during perf record.
> 
> The first patch is to reduce memory usage when many threads are used.
> The second is to avoid unncessary syscalls for kernel threads.  And
> the last one is to reduce the number of threads to iterate when new
> threads are being created at the same time.
> 
> Unfortunately there's no dramatic improvement here but I can see ~5%
> gain in the 'perf bench internals synthesize' on a big machine.
> (The numbers are not stable though)

Looks all good to me. The VmPeak assumption might be slightly
fragile, but I guess there's nothing better currently.

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Andi Kleen
> If you want a uarch name, that might be the name of the core
> ( something-cove ... I can't keep track of the names of all the

GoldenCove for Sapphire Rapids/AlderLake.

But keep in mind that they're still different. Kind of like a different
distro with different patches and configuration from the same base kernel
release.

-Andi


Re: [PATCH V2 20/25] perf/x86/intel: Add Alder Lake Hybrid support

2021-03-11 Thread Andi Kleen
On Thu, Mar 11, 2021 at 08:58:32PM +0100, Peter Zijlstra wrote:
> On Thu, Mar 11, 2021 at 11:53:35AM -0500, Liang, Kan wrote:
> 
> > > > The "cpu_core" PMU is similar to the Sapphire Rapids PMU, but without
> > > > PMEM.
> > > > The "cpu_atom" PMU is similar to Tremont, but with different
> > > > event_constraints, extra_regs and number of counters.
> 
> > > So do these things use the same event lists as SPR and TNT?
> > 
> > No, there will be two new event lists on ADL. One is for Atom core, and the
> > other is for big core. They are different to SPR and TNT.
> 
> *sigh* how different?

Atom and Big core event list have significant differences.

Often event lists of the same core in different SOCs are different too
because some events indicate stuff outside the core (e.g. Offcore Response
and others)

> 
> > > Is there any
> > > way to discover that, because AFAICT /proc/cpuinfo will say every CPU
> > > is 'Alderlake', and the above also doesn't give any clue.
> > > 
> > 
> > Ricardo once submitted a patch to expose the CPU type under
> > /sys/devices/system/cpu, but I don't know the latest status.
> > https://lore.kernel.org/lkml/20201003011745.7768-5-ricardo.neri-calde...@linux.intel.com/
> 
> Yeah, but that was useless, it doesn't list the Cores as
> FAM6_SAPPHIRERAPIDS nor the Atom as FAM6_ATOM_TREMONT.

It's Gracemont, not Tremont.

But what would you do with the information that the core is related
to some other core.

The event lists are tied to the Alderlake model number. You cannot
just use event lists from some other part because there are 
differences to other Golden Cove or Gracemont implementations.

For non event list usages, the model numbers also indicate a lot of things
in the SOC, so even you knew it was a somewhat similar core as Sapphire
Rapids, it wouldn't tell the complete story. Even on the cores there are
differences.

In the end you need to know that Alderlake is Alderlake.

-Andi


Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM

2021-03-11 Thread Andi Kleen
> When a brute force attack is detected through the fork or execve system call,
> all the tasks involved in the attack will be killed with the exception of the
> init task (task with pid equal to zero). Now, and only if the init task is
> involved in the attack, block the fork system call from the init process 
> during
> a user defined time (using a sysctl attribute). This way the brute force 
> attack
> is mitigated and the system does not panic.

That means nobody can log in and fix the system during that time.

Would be better to have that policy in init. Perhaps add some way
that someone doing wait*() can know the exit was due this mitigation
(and not something way) Then they could disable respawning of that daemon.

-Andi


Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM

2021-03-11 Thread Andi Kleen




Thanks.

Okay but that means that the brute force attack can just continue
because the attacked daemon will be respawned?

You need some way to stop the respawning, otherwise the
mitigation doesn't work for daemons.


-Andi



Re: [PATCH V2 16/25] perf/x86: Register hybrid PMUs

2021-03-11 Thread Andi Kleen
> AFAICT we could register them all here. That instantly fixes that
> CPU_STARTING / CPU_DEAD fail elsewhere in this patch.

This would mean a system that only has Atoms or only has big cores
would still show the other CPU's PMU. We expect those to exist.

-Andi


[PATCH] perf/x86/kvm: Fix inverted pebs_no_isolation check

2021-03-10 Thread Andi Kleen
The pebs_no_isolation optimization check is inverted. We want to disable
PEBS isolation when the microcode is at least the revision in the table,
not for older microcode. So remove the extra !.

It meant that the new microcodes did unnecessary work, and the old
microcodes could crash the PEBS. Presumably most people already
running with newer microcodes, so that functional impact is small.
But it should speed up the newer systems by the 2-4% claimed in
the original patch.

Cc: jmatt...@google.com
Fixes: 9b545c04abd4 ("perf/x86/kvm: Avoid unnecessary work ...")
Signed-off-by: Andi Kleen 
---
 arch/x86/events/intel/core.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/events/intel/core.c b/arch/x86/events/intel/core.c
index 5bac48d5c18e..d74deadd3a6d 100644
--- a/arch/x86/events/intel/core.c
+++ b/arch/x86/events/intel/core.c
@@ -4532,7 +4532,7 @@ static const struct x86_cpu_desc isolation_ucodes[] = {
 
 static void intel_check_pebs_isolation(void)
 {
-   x86_pmu.pebs_no_isolation = 
!x86_cpu_has_min_microcode_rev(isolation_ucodes);
+   x86_pmu.pebs_no_isolation = 
x86_cpu_has_min_microcode_rev(isolation_ucodes);
 }
 
 static __init void intel_pebs_isolation_quirk(void)
-- 
2.25.4



Re: [PATCHv2 0/4] perf/core: Add support to exclude kernel mode PMU tracing

2021-03-09 Thread Andi Kleen
> The disk encryption is just one example and there might be others which
> we might not be aware of yet and we are not suspecting there is something
> wrong with the crypto code that needs to be fixed.

Then you don't have any leaks relating to branch tracing.

> restrict an external(in the sense that its not related to crypto or any
> other security related component) entity such as hardware assisted tracing
> like ARM coresight and so on. I don't see why or how the crypto code needs
> to be fixed for something that is not related to it although it is affected.

It's just a general property that if some code that is handling secrets
is data dependent it already leaks.


> The analogy would be like of the victims and a perpetrator. Lets take 
> coresight
> as an example for perpetrator and crypto as the victim here. Now we can try

There's no victim with branch tracing, unless it is already leaky.

> If we just know one victim (lets say crypto code here), what happens to the
> others which we haven't identified yet? Do we just wait for someone to write
> an exploit based on this and then scramble to fix it?

For a useful security mitigation you need a threat model first I would say.

So you need to have at least some idea how an attack with branch
tracing would work.


> Initial change was to restrict this only to HW assisted instruction tracing 
> [1]

I don't think it's needed for instruction tracing.

-Andi


Re: [PATCH] perf auxtrace: Fix auxtrace queue conflict

2021-03-08 Thread Andi Kleen
On Mon, Mar 08, 2021 at 05:11:43PM +0200, Adrian Hunter wrote:
> The only requirement of an auxtrace queue is that the buffers are in
> time order.  That is achieved by making separate queues for separate
> perf buffer or AUX area buffer mmaps.
> 
> That generally means a separate queue per cpu for per-cpu contexts,
> and a separate queue per thread for per-task contexts.
> 
> When buffers are added to a queue, perf checks that the buffer cpu
> and thread id (tid) match the queue cpu and thread id.
> 
> However, generally, that need not be true, and perf will queue
> buffers correctly anyway, so the check is not needed.
> 
> In addition, the check gets erroneously hit when using sample mode
> to trace multiple threads.
> 
> Consequently, fix that case by removing the check.

Thanks!

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM

2021-03-07 Thread Andi Kleen
On Sun, Mar 07, 2021 at 07:05:41PM +0100, John Wood wrote:
> On Sun, Mar 07, 2021 at 09:25:40AM -0800, Andi Kleen wrote:
> > > processes created from it will be killed. If the systemd restart the 
> > > network
> > > daemon and it will crash again, then the systemd will be killed. I think 
> > > this
> > > way the attack is fully mitigated.
> >
> > Wouldn't that panic the system? Killing init is usually a panic.
> 
> The mitigation acts only over the process that crashes (network daemon) and 
> the
> process that exec() it (systemd). This mitigation don't go up in the processes
> tree until reach the init process.

Most daemons have some supervisor that respawns them when they crash. 
(maybe read up on "supervisor trees" if you haven't, it's a standard concept)

That's usually (but not) always init, as in systemd. There might be something
inbetween it and init, but likely init would respawn the something in between
it it. One of the main tasks of init is to respawn things under it.

If you have a supervisor tree starting from init the kill should eventually
travel up to init.

At least that's the theory. Do you have some experiments that show
this doesn't happen?

> 
> Note: I am a kernel newbie and I don't know if the systemd is init. Sorry if 
> it
> is a stupid question. AFAIK systemd is not the init process (the first process
> that is executed) but I am not sure.

At least the part of systemd that respawns is often (but not always) init.

> 
> >
> > > > Or if it's a interactive login you log in again.
> > >
> > > First the login will be killed (if it fails with a fatal signal) and if 
> > > it is
> > > restarted, the process that exec() it again will be killed. In this case 
> > > I think
> > > that the threat is also completely mitigated.
> >
> > Okay so sshd will be killed. And if it gets restarted eventually init,
> > so panic again.
> 
> In this scenario the process that exec() the login will be killed (sshd
> process). But I think that sshd is not the init process. So no panic.

sshd would be respawned by the supervisor, which is likely init.

> > That's a fairly drastic consequence because even without panic
> > it means nobody can fix the system anymore without a console.
> 
> So, you suggest that the mitigation method for the brute force attack through
> the execve system call should be different (not kill the process that exec).
> Any suggestions would be welcome to improve this feature.

If the system is part of some cluster, then panicing on attack or failure
could be a reasonable reaction. Some other system in the cluster should
take over. There's also a risk that all the systems get taken
out quickly one by one, in this case you might still need something
like the below.

But it's something that would need to be very carefully considered
for the environment.

The other case is when there isn't some fallback, as in a standalone
machine.

It could be only used when the supervisor daemons are aware of it.
Often they already have respawn limits, but would need to make sure they
trigger before your algorithm trigger. Or maybe some way to opt-out 
per process.  Then the DoS would be only against that process, but
not everything on the machine. 

So I think it needs more work on the user space side for most usages.

-Andi


Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM

2021-03-07 Thread Andi Kleen
> processes created from it will be killed. If the systemd restart the network
> daemon and it will crash again, then the systemd will be killed. I think this
> way the attack is fully mitigated.

Wouldn't that panic the system? Killing init is usually a panic.

> > Or if it's a interactive login you log in again.
> 
> First the login will be killed (if it fails with a fatal signal) and if it is
> restarted, the process that exec() it again will be killed. In this case I 
> think
> that the threat is also completely mitigated.

Okay so sshd will be killed. And if it gets restarted eventually init,
so panic again.

That's a fairly drastic consequence because even without panic 
it means nobody can fix the system anymore without a console.

So probably the mitigation means that most such attacks eventually lead
to a panic because they will reach init sooner or later.

Another somewhat worrying case is some bug that kills KVM guests.
So if the bug can be triggered frequently you can kill all the
virtualization management infrastructure.

I don't remember seeing a discussion of such drastic consequences in
your description. It might be ok depending on the use case,
but people certainly need to be aware of it.

It's probably not something you want to have enabled by default ever.

-Andi



Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM

2021-03-07 Thread Andi Kleen
Sorry for the late answer. I somehow missed your email earlier.

> As a mitigation method, all the offending tasks involved in the attack are
> killed. Or in other words, all the tasks that share the same statistics
> (statistics showing a fast crash rate) are killed.

So systemd will just restart the network daemon and then the attack works
again?

Or if it's a interactive login you log in again.

I think it might be useful even with these limitations, but it would
be good to spell out the limitations of the method more clearly.

I suspect to be useful it'll likely need some user space configuration
changes too.

-Andi


Re: [PATCH] perf pmu: Validate raw event with sysfs exported format bits

2021-03-04 Thread Andi Kleen
> Single event:
> 
>   # ./perf stat -e cpu/r031234/ -a -- sleep 1
>   WARNING: event config '31234' not valid (bits 16 17 not supported by 
> kernel)!

Just noticed that again. Can you please print the original event as 
string in the message? While it's obvious with rXXX which one it is, 
it may not be obvious with offsetted fields (like umask=0x121212),
and hard to find in a long command line.

-Andi


Re: [PATCHv2 0/4] perf/core: Add support to exclude kernel mode PMU tracing

2021-03-04 Thread Andi Kleen
Andi Kleen  writes:
>
> Normally disk encryption is in specialized work queues. It's total
> overkill to restrict all of the kernel if you just want to restrict
> those work queues.
>
> I would suggest some more analysis where secrets are actually stored
> and handled first.

Also thinking about this more:

You really only want to limit data tracing here.

If tracing branches could reveal secrets the crypto code would be
already insecure due to timing side channels. If that's the
case it would already require fixing the crypto code.

-Andi


Re: [PATCHv2 0/4] perf/core: Add support to exclude kernel mode PMU tracing

2021-03-04 Thread Andi Kleen
Sai Prakash Ranjan  writes:
>
> "Consider a system where disk contents are encrypted and the encryption
> key is set up by the user when mounting the file system. From that point
> on the encryption key resides in the kernel. It seems reasonable to
> expect that the disk encryption key be protected from exfiltration even
> if the system later suffers a root compromise (or even against insiders
> that have root access), at least as long as the attacker doesn't
> manage to compromise the kernel."

Normally disk encryption is in specialized work queues. It's total
overkill to restrict all of the kernel if you just want to restrict
those work queues.

I would suggest some more analysis where secrets are actually stored
and handled first.

-Andi


Re: [PATCH v5 7/8] Documentation: Add documentation for the Brute LSM

2021-02-28 Thread Andi Kleen
John Wood  writes:
> +
> +To detect a brute force attack it is necessary that the statistics shared by 
> all
> +the fork hierarchy processes be updated in every fatal crash and the most
> +important data to update is the application crash period.

So I haven't really followed the discussion and also not completely read
the patches (so apologies if that was already explained or is documented
somewhere else).

But what I'm missing here is some indication how much
memory these statistics can use up and how are they limited.

How much is the worst case extra memory consumption?

If there is no limit how is DoS prevented?

If there is a limit, there likely needs to be a way to throw out
information, and so the attack would just shift to forcing the kernel
to throw out this information before retrying.

e.g. if the data is hold for the parent shell: restart the parent
shell all the time.
e.g. if the data is hold for the sshd daemon used to log in:
Somehow cause sshd to respawn to discard the statistics.

Do I miss something here? How is that mitigated?

Instead of discussing all the low level tedious details of the
statistics it would be better to focus on these "high level"
problems here.

-Andi



Re: [PATCH] perf vendor events: Remove 'type' field from mapfiles

2021-02-26 Thread Andi Kleen
> I assumed that it was this file:
> https://download.01.org/perfmon/mapfile.csv

Yep.

> 
> which already has a different format - extra columns - so figured that some
> handcrafting was already going on. Indeed 'type' is not always 'core' there.

Yes. Could possibly add the extra columns from there just to stay compatible,
but being a subset is ok too.

-Andi


Re: [PATCH] perf vendor events: Remove 'type' field from mapfiles

2021-02-26 Thread Andi Kleen
On Fri, Feb 26, 2021 at 09:08:17PM +0800, John Garry wrote:
> Included for each CPU entry in the per-arch mapfile.csv file is a 'type'
> field.

I don't like it because this will make the mapfile incompatible with
what download.01.org uses. We originally derived it from that.

It would be ok if it fixed something serious, but as it seems to be
more cosmetic I don't think there's a good reason to devivate.

-Andi


Re: [PATCH 00/11] perf intel-pt: Add limited support for tracing guest kernels

2021-02-18 Thread Andi Kleen
On Thu, Feb 18, 2021 at 11:57:50AM +0200, Adrian Hunter wrote:
> Hi
> 
> Currently, only kernel tracing is supported and only with "timeless" decoding
> i.e. no TSC timestamps

Patches look good to me. That will be quite useful.

Acked-by: Andi Kleen 

Thanks,
-Andi


Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-14 Thread Andi Kleen
On Fri, Feb 12, 2021 at 01:48:36PM -0800, Dave Hansen wrote:
> On 2/12/21 1:47 PM, Andy Lutomirski wrote:
> >> What about adding a property to the TD, e.g. via a flag set during TD 
> >> creation,
> >> that controls whether unaccepted accesses cause #VE or are, for all 
> >> intents and
> >> purposes, fatal?  That would allow Linux to pursue treating EPT #VEs for 
> >> private
> >> GPAs as fatal, but would give us a safety and not prevent others from 
> >> utilizing
> >> #VEs.
> > That seems reasonable.
> 
> Ditto.
> 
> We first need to double check to see if the docs are right, though.

I confirmed with the TDX module owners that #VE can only happen for:
- unaccepted pages
- instructions like MSR access or CPUID
- specific instructions that are no in the syscall gap

Also if there are future asynchronous #VEs they would only happen
with IF=1, which would also protect the gap.

So no need to make #VE an IST.

-Andi



Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Andi Kleen
> > > So what happens if NMI happens here, and triggers a nested #VE ?
> > 
> > Yes that's a gap. We should probably bail out and reexecute the original
> > instruction. The VE handler would need to set a flag for that.
> > 
> > Or alternatively the NMI always gets the VE information and puts
> > it on some internal stack, but that would seem clunkier.
> 
> The same is possible with MCE and #DB I imagine.

I don't think there are currently any plans to inject #MC into TDX guests. It's
doubtful this could be done securely.

#DB is trickier because it will happen every time, so simply reexecuting
won't work. I guess it would need the ve info stack, or some care in 
kprobes/kernel
debugger that it cannot happen. I think I would prefer the later.

-Andi



Re: [RFC v1 05/26] x86/traps: Add #VE support for TDX guest

2021-02-08 Thread Andi Kleen
> Which is supposedly then set up to avoid #VE during the syscall gap,
> yes? Which then results in #VE not having to be IST.

Yes that is currently true because all memory is pre-accepted.

If we ever do lazy accept we would need to make sure the memory accessed in
the syscall gap is already accepted, or move over to an IST.

> > +#ifdef CONFIG_INTEL_TDX_GUEST
> > +DEFINE_IDTENTRY(exc_virtualization_exception)
> > +{
> > +   struct ve_info ve;
> > +   int ret;
> > +
> > +   RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
> > +
> > +   /* Consume #VE info before re-enabling interrupts */
> 
> So what happens if NMI happens here, and triggers a nested #VE ?

Yes that's a gap. We should probably bail out and reexecute the original
instruction. The VE handler would need to set a flag for that.

Or alternatively the NMI always gets the VE information and puts
it on some internal stack, but that would seem clunkier.


-Andi


Re: [PATCH RESEND] perf/x86/kvm: Add Cascade Lake Xeon steppings to isolation_ucodes[]

2021-02-05 Thread Andi Kleen
On Fri, Feb 05, 2021 at 11:13:24AM -0800, Jim Mattson wrote:
> Cascade Lake Xeon parts have the same model number as Skylake Xeon
> parts, so they are tagged with the intel_pebs_isolation
> quirk. However, as with Skylake Xeon H0 stepping parts, the PEBS
> isolation issue is fixed in all microcode versions.
> 
> Add the Cascade Lake Xeon steppings (5, 6, and 7) to the
> isolation_ucodes[] table so that these parts benefit from Andi's
> optimization in commit 9b545c04abd4f ("perf/x86/kvm: Avoid unnecessary
> work in guest filtering").

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH 4/0] perf intel-pt: Add PSB events

2021-02-05 Thread Andi Kleen
On Fri, Feb 05, 2021 at 07:53:46PM +0200, Adrian Hunter wrote:
> Hi
> 
> Here are 3 fixes and 1 minor new feature, for Intel PT.

For the series:

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH] perf/core: Emit PERF_RECORD_LOST for pinned events

2021-01-18 Thread Andi Kleen
> > I don't think I object to having an even in the stream, but your LOST
> > event is unfortunate in that it itself can get lost when there's no
> > space in the buffer (which arguably is unlikely, but still).
> >
> > So from that point of view, I think overloading LOST is not so very nice
> > for this.
> 
> But anything can get lost in case of no space.
> Do you want to use something other than the LOST event?

Could always reserve the last entry in the ring buffer for a LOST event,
that would guarantee you can always get one out.

-Andi


Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

2021-01-15 Thread Andi Kleen
On Fri, Jan 15, 2021 at 10:51:38AM -0800, Sean Christopherson wrote:
> On Fri, Jan 15, 2021, Andi Kleen wrote:
> > > I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS 
> > > PMI"
> > > guaranteed to be atomic?
> > 
> > Of course not.
> 
> So there's still a window where the guest could observe the bad counter index,
> correct?

Yes.

But with single record PEBS it doesn't really matter with normal perfmon
drivers.

-Andi


Re: [PATCH v3 00/17] KVM: x86/pmu: Add support to enable Guest PEBS via DS

2021-01-15 Thread Andi Kleen
> I'm asking about ucode/hardare.  Is the "guest pebs buffer write -> PEBS PMI"
> guaranteed to be atomic?

Of course not.
> 
> In practice, under what scenarios will guest counters get cross-mapped?  And,
> how does this support affect guest accuracy?  I.e. how bad do things get for 
> the
> guest if we simply disable guest counters if they can't have a 1:1 association
> with their physical counter?

This would completely break perfmon for the guest, likely with no way to
recover.

-Andi


Re: [PATCH] mm: Fix potential pte_unmap_unlock pte error

2021-01-10 Thread Andi Kleen
On Sat, Jan 09, 2021 at 03:01:18AM -0500, Miaohe Lin wrote:
> Since commit 42e4089c7890 ("x86/speculation/l1tf: Disallow non privileged
> high MMIO PROT_NONE mappings"), when the first pfn modify is not allowed,
> we would break the loop with pte unchanged. Then the wrong pte - 1 would
> be passed to pte_unmap_unlock.

Thanks.

While the fix is correct, I'm not sure if it actually is a real bug. Is there
any architecture that would do something else than unlocking the underlying
page?  If it's just the underlying page then it should be always the same
page, so no bug.

That said of course the change is the right thing for main line, but probably 
doesn't
need to be backported.

-Andi


Re: [RFC PATCH 0/7] Share events between metrics

2020-12-15 Thread Andi Kleen
> Or, is the concern more about trying to time-slice the results in a 
> fairly granular way and expecting accurate results then?

Usually the later. It's especially important for divisions. You want
both divisor and dividend to be in the same time slice, otherwise
the result usually doesn't make a lot of sense.

-Andi


Re: [PATCH 0/5] perf stat: Introduce --iiostat mode to provide I/O performance metrics

2020-12-14 Thread Andi Kleen
> My first thought was: Why not have a 'perf iiostat' subcommand?

Same would apply to a lot of options in perf stat.

I guess you could add some aliases to "perf" that give shortcuts
for common perf stat command lines.

-Andi


Re: [RFC 1/2] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

2020-12-02 Thread Andi Kleen
On Wed, Dec 02, 2020 at 11:47:25AM -0800, Stephane Eranian wrote:
> On Wed, Dec 2, 2020 at 11:28 AM Andi Kleen  wrote:
> >
> > > + prev_cgrp = task_css_check(prev, perf_event_cgrp_id, 1)->cgroup;
> > > + next_cgrp = task_css_check(next, perf_event_cgrp_id, 1)->cgroup;
> > > +
> > > + if (prev_cgrp != next_cgrp)
> > > + perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0);
> >
> > Seems to be the perf cgroup only, not all cgroups.
> > That's a big difference and needs to be documented properly.
> >
> We care about the all-cgroup case.

Then it's not correct I think. You need a different hook point.

-Andi


Re: [RFC 1/2] perf core: Add PERF_COUNT_SW_CGROUP_SWITCHES event

2020-12-02 Thread Andi Kleen
> + prev_cgrp = task_css_check(prev, perf_event_cgrp_id, 1)->cgroup;
> + next_cgrp = task_css_check(next, perf_event_cgrp_id, 1)->cgroup;
> +
> + if (prev_cgrp != next_cgrp)
> + perf_sw_event_sched(PERF_COUNT_SW_CGROUP_SWITCHES, 1, 0);

Seems to be the perf cgroup only, not all cgroups.
That's a big difference and needs to be documented properly.

Probably would make sense to have two events for both, one for 
all cgroups and one for perf only.



-Andi


Re: [PATCH v4 5/5] perf metric: Don't compute unused events.

2020-12-01 Thread Andi Kleen
>  Can't this be all in a macro? It's still a lot of duplication.
> 
>  I'm still not a fan, but I think with a macro it wouldn't be too bad.
> 
>Ugh, not a fan of macros. They expand to a single line of code and make
>debugging hard. I'll do a v5 with a macro :-/

I suppose you could inlines with callbacks too, with one liner functions for
the operands.

-Andi


Re: [PATCH v4 5/5] perf metric: Don't compute unused events.

2020-12-01 Thread Andi Kleen
> +| expr '-' expr
> +{
> + if (!compute_ids || (is_const($1.val) && is_const($3.val))) {
> + assert($1.ids == NULL);
> + assert($3.ids == NULL);
> + $$.val = $1.val - $3.val;
> + $$.ids = NULL;
> + } else {
> + /* LHS and/or RHS need computing from event IDs so union. */
> + $$ = union_expr($1, $3);
> + }

Can't this be all in a macro? It's still a lot of duplication.

I'm still not a fan, but I think with a macro it wouldn't be too bad.


-Andi


Re: [PATCH 0/3] clear_warn_once: add timed interval resetting

2020-11-29 Thread Andi Kleen
On Thu, Nov 26, 2020 at 01:30:26AM -0500, Paul Gortmaker wrote:
> But you currently can't make use of clear_warn_once unless you've got
> debugfs enabled and mounted - which may not be desired by some people
> in some deployment situations.

Seems awfully special purpose. The problem with debugfs is security,
or is it no convenient process that could do cron like functionality? 

If it's the first, perhaps what they really need is a way to get
a partial debugfs? 

-Andi


Re: [PATCH v3 5/5] perf metric: Don't compute unused events.

2020-11-22 Thread Andi Kleen
> +| expr '|' expr
> +{
> + if (!compute_ids || (isfinite($1.val) && isfinite($3.val))) {
> + assert($1.ids == NULL);
> + assert($3.ids == NULL);
> + $$.val = (long)$1.val | (long)$3.val;
> + $$.ids = NULL;
> + } else {
> + /*
> +  * LHS or RHS needs to be computed from event IDs, consequently
> +  * so does this expression. Set val to NAN to show that the set
> +  * of all values is possible, the events are the union of those
> +  * on the LHS and RHS.
> +  */
> + $$.val = NAN;
> + $$.ids = ids__union($1.ids, $3.ids);
> + }


Sorry, still not a fan of the having this nan code all over. It's just ugly.

If you don't want to do the syntax change to still do it in one pass,
and given building an AST would be a little complicated.

The original parser I based this code on actually had a byte code version too
(see attachment). With that one the lazy evaluation could be done on the byte 
code
level. Originally I didn't include it because it wasn't really
needed for perf, but presumably if we want to do more complicated
things it might be useful. 

In theory it could speed up performance slightly when an expression needs
to be computed multiple times in interval mode.

-Andi
#include 
#include 
#include 
#include 

#include "code.h"
#include "sym.h"
#include "error.h"

static unsigned char *bufs;
static unsigned bufl;
static unsigned char *bufp;

static void overflow(void)
{
yyerror("expression exceeds execution buffer");
}

void put_op(enum ops op)
{
if (bufp >= bufs + bufl)
overflow();
*bufp++ = op;
}

void put_long(enum ops op, long arg)
{
if (bufp + sizeof(long) + 1 >= bufs + bufl)
overflow();
*bufp++ = op;
memcpy(bufp, , sizeof(long));
bufp += sizeof(long);
}

void put_ptr(enum ops op, void * arg)
{
if (bufp + sizeof(void *) + 1 >= bufs + bufl)
overflow();
*bufp++ = op;
memcpy(bufp, , sizeof(void *));
bufp += sizeof(void *);
}

void start_def(struct symbol *s)
{
if (!s->undef)
yyerror("symbol %s redefined", s->name);
}

void end_def(struct symbol *s)
{
int len;

put_op(OP_RET);
len = bufp - bufs;
s->code = malloc(len);
memcpy(s->code, bufs, len);
bufp = bufs;
s->undef = 0;
}

static void execerror(char const *msg)
{
fprintf(stderr, "expression execution error: %s\n", msg);
exit(1);
}

#define STACKSIZE 16
#define CSTACKSIZE 16

long execute(unsigned char *bufp)
{
static void *target[] = {
[OP_END]= &,
[OP_EVENT]  = NULL,
[OP_NUMBER] = &,
[OP_PLUS]   = &,
[OP_MINUS]  = &,
[OP_MUL]= &,
[OP_DIV]= &,
[OP_MOD]= &,
[OP_NEGATE] = &,
[OP_CALL]   = &,
[OP_RET]= &,
[OP_XOR]= &,
[OP_OR] = &,
[OP_NOT]= &,
[OP_AND]= &,
[OP_SHL]= &,
[OP_SHR]= &,
};
long a, b;
long stack[STACKSIZE];
int stackp = 0;
unsigned char *callstack[CSTACKSIZE];
int callstackp = 0;

#define getop(x) memcpy(&(x), bufp, sizeof(x)); bufp += sizeof(x)
#define push(x)  stack[stackp++] = (x)
#define pop()stack[--stackp]
#define next()   goto *target[(int)*bufp++]

next();

number:
if (stackp == STACKSIZE)
execerror("expression stack overflow");
getop(a);
push(a);
next();

#define OP(op) \
b = pop();  \
a = pop();  \
push(a op b);   \
next()

plus:   OP(+);
minus:  OP(-);
mul:OP(*);

div:
b = pop();
if (b == 0)
execerror("division by 0");
a = pop();
push(a / b);
next();

mod:
b = pop();
if (b == 0)
execerror("modulo by 0");
a = pop();
push(a % b);
next();

negate:
a = pop();
push(-a);
next();

and:OP(&);
or: OP(|);
xor:OP(^);
shl:OP(<<);
shr:OP(>>);

not:
a = pop();
push(~a);
next();

call:   {
struct symbol *s;
getop(s);
if (callstackp == CSTACKSIZE)
execerror("call stack overflow");
callstack[callstackp++] = bufp;
bufp = s->code;
next();
}

ret:
bufp = callstack[--callstackp];
next();

end:
assert(bufp == bufs || stackp == 1);
assert(callstackp == 0);
return stack[0];
}

#undef next
#undef push
#undef pop
#undef getop

void code_init(unsigned char *buf, unsigned 

Re: [PATCH] perf vendor events: Update Skylake client events to v50

2020-11-16 Thread Andi Kleen
> You mean change event-converter-for-linux-perf to add this as JSON
> comments at the start of the generated files?

JSON doesn't support comments unfortunately
(unless they're somehow part of the data schema)

-Andi


Re: [PATCH] perf vendor events: Update Skylake client events to v50

2020-11-16 Thread Andi Kleen
>I'd prefer if we could make a copy of these scripts into the kernel tree
>along with the data files from:
>https://download.01.org/perfmon/

FWIW I originally tried this to include the raw JSON files, but Ingo objected
and wanted the files to be split up[1] and avoiding redundant headers. 
That's why we ended up with the "compiled" format.

[1] https://lkml.org/lkml/2015/5/28/336

>Having a generated file that can't be edited, reviewed, .. I also worry
>that if/when we change the json format then event-converter-for-linux-perf
>will need to support a multitude of perf versions.

You mean for backports? The assumption was that perf backports would backport
any perf changes for new formats too.

In general I generally discourage any perf tools backport, people should just 
use
a new version of the perf tool on old kernels.

In principle the scripts could be included, but without the raw files it would
be somewhat pointless.

-andi


Re: [PATCH] perf vendor events: Update Skylake client events to v50

2020-11-15 Thread Andi Kleen
> Can I get ACK for this patch?
> 
> It's mainly for fixing the perf-test issue and MEM_Parallel_Reads issue.

Acked-by: Andi Kleen 

The JSON updates should be imho just merged automatically. Not much
point in doing code review on them. If there's a problem it has 
to be fixed JSON upstream anyways.

-Andi


Re: [PATCH] perf stat: Take cgroups into account for shadow stats

2020-11-15 Thread Andi Kleen
Actually thinking about it more you should probably pass around ctx/cgroup
in a single abstract argument. Otherwise have to change all the metrics
functions for the next filter too.


Re: [PATCH] perf stat: Take cgroups into account for shadow stats

2020-11-15 Thread Andi Kleen
> @@ -57,6 +59,9 @@ static int saved_value_cmp(struct rb_node *rb_node, const 
> void *entry)
>   if (a->ctx != b->ctx)
>   return a->ctx - b->ctx;
>  
> + if (a->cgrp != b->cgrp)
> + return (char *)a->cgrp < (char *)b->cgrp ? -1 : +1;

This means the sort order will depend on heap randomization, 
which will make it harder to debug.

Better use something stable like the inode number of the cgroup.

Do we have the same problem with other filters?

The rest of the patch looks good to me.

-Andi


Re: [PATCH 5/5] perf metric: Don't compute unused events.

2020-11-13 Thread Andi Kleen
The patch does a lot of stuff and is hard to review.

The earlier patches all look good to me.

>  
>  static int
>  __expr__parse(double *val, struct expr_parse_ctx *ctx, const char *expr,
> -   int start, int runtime)
> +   bool compute_ids, int runtime)
>  {
>   struct expr_scanner_ctx scanner_ctx = {
> - .start_token = start,

Okay so why do we not need that anymore?

This should probably be a separate change.

I'm not sure about the NaN handling. Why can't you use some 
other value (perhaps 0 with some special case for division) 

After all the computed values of the dead branch shouldn't 
be used anyways, so I don't think we need the extra NaN 
ceremony everywhere.

The only thing that really matters is to track that the event
is not added and not resolved.

I think it would be a lot simpler if we changed the IF syntax.
The problem right now is that the condition is after the first
expression, so you can't use global state with an early reduction
to track if the if branch is dead or not.

If we used the C style cond ? a : b  it could be

cond { set flag } '?' expr  { reset flag } : expr { reset flag }

scanner checks flag and ignores the event if false

This would need changing the JSON files but that should be easy enough
with a script.

Or maybe at some point need to bite the bullet and build
an AST, but for now we probably can get away of not doing it.


-Andi


Re: [PATCH] perf stat: Use proper cpu for shadow stats

2020-11-13 Thread Andi Kleen
>   CPU0  109,621,384  cycles
>   CPU1  159,026,454  cycles
>   CPU2   99,460,366  cycles
>   CPU3  124,144,142  cycles
>   CPU0   44,396,706  instructions  #0.41  insn per cycle
>   CPU1  120,195,425  instructions  #0.76  insn per cycle
>   CPU2   44,763,978  instructions  #0.45  insn per cycle
>   CPU3   69,049,079  instructions  #0.56  insn per cycle
> 
>1.001910444 seconds time elapsed

Yes makes a lot of sense. Thanks for tracking that down.

Reviewed-by: Andi Kleen 

-Andi


Re: [RFC PATCH 00/12] Topdown parser

2020-11-11 Thread Andi Kleen
On Wed, Nov 11, 2020 at 08:09:49PM -0800, Ian Rogers wrote:
>  >    to the optimization manual the group Topdown_Group_TopDownL1
>  provides the
>  >   
>  metrics Topdown_Metric_Frontend_Bound, Topdown_Metric_Backend_Bound, 
> Topdown_Metric_Bad_Speculation
>  >    and Topdown_Metric_Retiring. The hope is the events here will all
>  be
>  >    scheduled without multiplexing.
> 
>  That's not necessarily true. Some of the newer expressions are quite
>  complex (e.g .due to workarounds or because the events are complex, like
>  the FLOPS events) There's also some problems with the
>  scheduling of the fixed metrics on Icelake+, that need special handling.
> 
>For FLOPS I see:
>( #FP_Arith_Scalar + #FP_Arith_Vector ) / ( 2 * CORE_CLKS ) 
>is the concern about multiplexing? Could we create metrics/groups aware of
>the limitations?

If you expand it you'll end up with a lot of events, so it has to be split
into groups. But you still need to understand the rules, otherwise the tool
ends up with non schedulable groups.

For example here's a group schedule generated by toplev for level 3 Icelake.
On pre Icelake with only 4 counters it's more complicated.

Microcode_Sequencer[3] Heavy_Operations[1] Memory_Bound[1]
Branch_Mispredicts[1] Fetch_Bandwidth[1] Other[4] Heavy_Operations[3]
Frontend_Bound[1] FP_Arith[1] Backend_Bound[1] Light_Operations[3]
Microcode_Sequencer[1] FP_Arith[4] Fetch_Bandwidth[2] Light_Operations[1]
Retiring[1] Bad_Speculation[1] Machine_Clears[1] Other[1] Core_Bound[1]
Ports_Utilization[1]:
  perf_metrics.frontend_bound[f] perf_metrics.bad_speculation[f]
  topdown.slots[f] perf_metrics.backend_bound[f] perf_metrics.retiring[f] [0
  counters]
Machine_Clears[2] Branch_Mispredicts[2] ITLB_Misses[3] Branch_Mispredicts[1]
Fetch_Bandwidth[1] Machine_Clears[1] Frontend_Bound[1] Backend_Bound[1]
ICache_Misses[3] Fetch_Latency[2] Fetch_Bandwidth[2] Bad_Speculation[1]
Memory_Bound[1] DSB_Switches[3]:
  inst_retired.any[f] machine_clears.count cpu_clk_unhalted.thread[f]
  int_misc.recovery_cycles:c1:e1 br_misp_retired.all_branches
  idq_uops_not_delivered.cycles_0_uops_deliv.core topdown.slots[f]
  dsb2mite_switches.penalty_cycles icache_64b.iftag_stall
  int_misc.uop_dropping icache_16b.ifdata_stall [8 counters]
Core_Bound[1] Core_Bound[2] Heavy_Operations[3] Memory_Bound[2]
Light_Operations[3]:
  cycle_activity.stalls_mem_any idq.ms_uops exe_activity.1_ports_util
  exe_activity.exe_bound_0_ports exe_activity.2_ports_util
  exe_activity.bound_on_stores int_misc.recovery_cycles:c1:e1 uops_issued.any
  [8 counters]
Microcode_Sequencer[3] Store_Bound[3] Branch_Resteers[3] MS_Switches[3]
Divider[3] LCP[3]:
  int_misc.clear_resteer_cycles arith.divider_active
  cpu_clk_unhalted.thread[f] idq.ms_switches ild_stall.lcp baclears.any
  exe_activity.bound_on_stores idq.ms_uops uops_issued.any [8 counters]
L1_Bound[3] L3_Bound[3]:
  cycle_activity.stalls_l1d_miss cycle_activity.stalls_mem_any
  cycle_activity.stalls_l2_miss cpu_clk_unhalted.thread[f]
  cycle_activity.stalls_l3_miss [4 counters]
DSB[3] MITE[3]:
  idq.mite_cycles_ok idq.mite_cycles_any cpu_clk_unhalted.distributed
  idq.dsb_cycles_any idq.dsb_cycles_ok [5 counters]
LSD[3] L2_Bound[3]:
  cycle_activity.stalls_l1d_miss cpu_clk_unhalted.thread[f]
  cpu_clk_unhalted.distributed lsd.cycles_ok lsd.cycles_active
  cycle_activity.stalls_l2_miss [5 counters]
Other[4] FP_Arith[4] L2_Bound[3] DRAM_Bound[3]:
  mem_load_retired.fb_hit mem_load_retired.l2_hit
  fp_arith_inst_retired.512b_packed_single mem_load_retired.l1_miss
  fp_arith_inst_retired.512b_packed_double l1d_pend_miss.fb_full_periods [6
  counters]
Ports_Utilization[3] DRAM_Bound[3]:
  cycle_activity.stalls_l1d_miss arith.divider_active mem_load_retired.l2_hit
  exe_activity.1_ports_util cpu_clk_unhalted.thread[f]
  cycle_activity.stalls_l3_miss exe_activity.2_ports_util
  cycle_activity.stalls_l2_miss exe_activity.exe_bound_0_ports [8 counters]
Other[4] FP_Arith[4]:
  fp_arith_inst_retired.128b_packed_single uops_executed.thread
  uops_executed.x87 fp_arith_inst_retired.scalar_double
  fp_arith_inst_retired.256b_packed_single
  fp_arith_inst_retired.scalar_single
  fp_arith_inst_retired.128b_packed_double
  fp_arith_inst_retired.256b_packed_double [8 counters]


>Ok, so we can read the threshold from the spreadsheet and create an extra
>metric for if the metric above the threshold?

Yes it can be all derived from the spreadsheet.

You need a much more complicated evaluation algorithm though, single
pass is not enough.

> 
>  Also in other cases it's probably better to not drilldown, but collect
>  everything upfront, e.g. when someone else is doing the collection
>  for you. In this case the thresholding has to be figured out from
>  existing data.
> 
>This sounds like perf record support for metrics, which I think is a good
>idea but not currently a use-case I'm trying to solve.

I meant just a single 

Re: [RFC PATCH 00/12] Topdown parser

2020-11-11 Thread Andi Kleen
Hi Ian,

On Wed, Nov 11, 2020 at 03:49:10PM -0800, Ian Rogers wrote:
>  FWIW I did something similar in python (that's how the current metrics
>  json files were generated from the spreadsheet) and it was a lot
>  simpler and shorter in a higher level language.
> 
>The use of C++ here was intended to be a high-level language usage :-)

FWIW this is our script for the metrics.

https://github.com/intel/event-converter-for-linux-perf/blob/master/extract-tmam-metrics.py

It has a few hacks, but overall isn't too bad.

> 
> 
>  One problem I see with putting the full TopDown model into perf is
>  that to do a full job it requires a lot more infrastructure that is
>  currently not implemented in metrics: like an event scheduler,
>  hierarchical thresholding over different nodes, SMT mode support etc.
> 
>  I implemented it all in toplev, but it was a lot of work to get it all
>  right.
>  I'm not saying it's not doable, but it will be a lot of additional work
>  to work out all the quirks using the metrics infrastructure.
> 
>  I think adding one or two more levels is probably ok, but doing all
>  levels
>  without these mechanisms might be difficult to use in the end.
> 
>Thanks Andi, I'm working from the optimization manual and trying to
>understand this. With this contribution we have both metrics and
>groups that correspond to the levels in the tree. Following a similar flow
>to the optimization manual the group Topdown_Group_TopDownL1 provides the
>metrics Topdown_Metric_Frontend_Bound, Topdown_Metric_Backend_Bound, 
> Topdown_Metric_Bad_Speculation
>and Topdown_Metric_Retiring. The hope is the events here will all be
>scheduled without multiplexing.

That's not necessarily true. Some of the newer expressions are quite
complex (e.g .due to workarounds or because the events are complex, like
the FLOPS events) There's also some problems with the
scheduling of the fixed metrics on Icelake+, that need special handling.

> Let's say Topdown_Metric_Backend_Bound is
>an outlier and so then you re-run perf to drill into the group
>Topdown_Group_Backend which will provide the
>metrics Topdown_Metric_Memory_Bound and Topdown_Metric_Core_Bound. If the
>metric Topdown_Metric_Memory_Bound is the outlier then you can use the
>group Topdown_Group_Memory_Bound to drill into DRAM, L1, L2, L3, PMM and

I think at a minimum you would need a threshold concept for this, otherwise
the users don't know what passed or what didn't, and where to drill
down.

Also in modern TMAM some events thresholds are not only based on their
parents but based on triggers cross true (e.g. $issue). Doing this
by hand is somewhat tricky.

BTW toplev has an auto drilldown mode now to automate this (--drilldown)

Also in other cases it's probably better to not drilldown, but collect
everything upfront, e.g. when someone else is doing the collection
for you. In this case the thresholding has to be figured out from
existing data.

The other biggie which is currently not in metrics is per core mode,
which is needed for many metrics on CPUs older than Icelake. This really
has to be supported in some way, otherwise the results on pre Icelake SMT 
are not good (Icelake fixes this problem)

>like SMT_on that may appear in the spreadsheet in expressions like:
>( CPU_CLK_UNHALTED.THREAD_ANY / 2 ) if #SMT_on else CLKS
>These are passed through and then the perf expression parser will at
>runtime compute the SMT_ON value using smt_on() that reads
>from devices/system/cpu/cpu%d/topology/thread_siblings. Here is a
>generated example for Skylakex:

Yes I implemented that. But that's not enough for the full TMAM, you also need 
per core
mode for these (but not for others).

At the level 1 we get away by either being in per core mode or not, but 
that doesn't work deeper in the hierarchy.

>SMT_ON needs evaluating early. This is a latent perf metric issue that
>would benefit everyone if fixed, so I'll try to address it in a separate
>change.
>Event scheduling is missing and as discussed there are some naive
>heuristics currently in perf. It seems we can improve this, for example a
>group with events {A,B,C} and another with {B,C,D} could be merged to
>create {A,B,C,D} but we'd need to know the number of counters. We could

It's more complicated with OCR and similar special events, which
can fit only a limited number into a group.

Also on Icelake you need to know about the 4 vs 8 counter constraints,
so it requires some knowledge of the event list.

With the fixed metrics there are also special rules, like slots needs
to be first and be followed by the metrics.

Also if you don't do some deduping you end up with a lot of redundant
events being scheduled. Also when there are multiple levels of expressions
usually there is a need for sub grouping by levels etc.

It's not a simple algorithm.

>provide 

Re: [RFC PATCH 00/12] Topdown parser

2020-11-11 Thread Andi Kleen
On Tue, Nov 10, 2020 at 02:03:34AM -0800, Ian Rogers wrote:
> This RFC is for a new tool that reads TMA_Metrics.csv as found on
> download.01.org/perfmon and generates metrics and metric groups from
> it. To show the functionality the TMA_Metrics.csv is downloaded, but
> an accepted change would most likely include a copy of this file from
> Intel. With this tool rather than just level 1 topdown metrics, a full
> set of topdown metrics to level 4 are generated.

I'm not sure I understand the motivation for making the spreadsheet parsing
part of the perf tool? It only needs to run once when the metrics are generated
to build perf.

FWIW I did something similar in python (that's how the current metrics
json files were generated from the spreadsheet) and it was a lot
simpler and shorter in a higher level language.

One problem I see with putting the full TopDown model into perf is 
that to do a full job it requires a lot more infrastructure that is
currently not implemented in metrics: like an event scheduler,
hierarchical thresholding over different nodes, SMT mode support etc.

I implemented it all in toplev, but it was a lot of work to get it all right.
I'm not saying it's not doable, but it will be a lot of additional work
to work out all the quirks using the metrics infrastructure.

I think adding one or two more levels is probably ok, but doing all levels
without these mechanisms might be difficult to use in the end.

-Andi


Re: [PATCH 24/24] perf record: Add --buildid-mmap option to enable mmap's build id

2020-11-11 Thread Andi Kleen
On Mon, Nov 09, 2020 at 10:54:15PM +0100, Jiri Olsa wrote:
> Adding --buildid-mmap option to enable build id in mmap2 events.
> It will only work if there's kernel support for that and it disables
> build id cache (implies --no-buildid).

What's the point of the option? Why not enable it by default
if the kernel supports it?

With the option most user won't get the benefit.

The only reason I can think of for an option would be to disable
so that old tools can still process.

-Andi


Re: [PATCH v12 01/11] perf/x86: Fix variable types for LBR registers

2020-11-08 Thread Andi Kleen
On Sat, Jun 13, 2020 at 04:09:46PM +0800, Like Xu wrote:
> From: Wei Wang 
> 
> The MSR variable type can be 'unsigned int', which uses less memory than
> the longer 'unsigned long'. Fix 'struct x86_pmu' for that. The lbr_nr won't
> be a negative number, so make it 'unsigned int' as well.

Hi, 

What's the status of this patchkit? It would be quite useful to me (and
various other people) to use LBRs in guest. I reviewed it earlier and the
patches all looked good to me.  But i don't see it in any -next tree.

Reviewed-by: Andi Kleen 

Could it please be merged?

Thanks,

-Andi


Re: [PATCH v2 0/4] perf: Fix perf_event_attr::exclusive rotation

2020-11-02 Thread Andi Kleen
On Mon, Nov 02, 2020 at 03:16:25PM +0100, Peter Zijlstra wrote:
> On Sun, Nov 01, 2020 at 07:52:38PM -0800, Andi Kleen wrote:
> > The main motivation is actually that the "multiple groups" algorithm
> > in perf doesn't work all that great: it has quite a few cases where it
> > starves groups or makes the wrong decisions. That is because it is very
> > difficult (likely NP complete) problem and the kernel takes a lot
> > of short cuts to avoid spending too much time on it.
> 
> The event scheduling should be starvation free, except in the presence
> of pinned events.
> 
> If you can show starvation without pinned events, it's a bug.
> 
> It will also always do equal or better than exclusive mode wrt PMU
> utilization. Again, if it doesn't it's a bug.

Simple example (I think we've shown that one before):

(on skylake)
$ cat /proc/sys/kernel/nmi_watchdog
0
$ perf stat -e 
instructions,cycles,frontend_retired.latency_ge_2,frontend_retired.latency_ge_16
 -a sleep 2

 Performance counter stats for 'system wide':

   654,514,990  instructions  #0.34  insn per cycle 
  (50.67%)
 1,924,297,028  cycles  
  (74.28%)
21,708,935  frontend_retired.latency_ge_2   
  (75.01%)
 1,769,952  frontend_retired.latency_ge_16  
   (24.99%)

   2.002426541 seconds time elapsed

The second frontend_retired should be both getting 50% and the fixed events 
should be getting
100%. So several events are starved.

Another similar example is trying to schedule the topdown events on Icelake in 
parallel to other
groups. It works with one extra group, but breaks with two.

(on icelake)
$ cat /proc/sys/kernel/nmi_watchdog
0
$ perf stat -e 
'{slots,topdown-bad-spec,topdown-be-bound,topdown-fe-bound,topdown-retiring},{branches,branches,branches,branches,branches,branches,branches,branches},{branches,branches,branches,branches,branches,branches,branches,branches}'
 -a sleep 1

 Performance counter stats for 'system wide':

71,229,087  slots   
  (60.65%)
 5,066,320  topdown-bad-spec  #  7.1% bad speculation   
  (60.65%)
35,080,387  topdown-be-bound  # 49.2% backend bound 
  (60.65%)
22,769,750  topdown-fe-bound  # 32.0% frontend bound
  (60.65%)
 8,336,760  topdown-retiring  # 11.7% retiring  
  (60.65%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
   424,584  branches
  (70.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)
 3,634,075  branches
  (30.00%)

   1.001312511 seconds time elapsed

A tool using exclusive hopefully will be able to do better than this.

-Andi


Re: [PATCH v2 0/4] perf: Fix perf_event_attr::exclusive rotation

2020-11-01 Thread Andi Kleen
> hm, it's too late for me to check ;-) but should I be able to do
> this with exclusive event.. running both command at the same time:

Yes. The exclusive part only applies during a given context,
but the two commands are different contexts.

You would only see a difference when in the same context,
and you have multiple groups (or events) that could in theory schedule
in parallel

e.g. something like

perf stat -e '{cycles,cycles},{cycles,cycles}'  ...

The main motivation is actually that the "multiple groups" algorithm
in perf doesn't work all that great: it has quite a few cases where it
starves groups or makes the wrong decisions. That is because it is very
difficult (likely NP complete) problem and the kernel takes a lot
of short cuts to avoid spending too much time on it.

With exclusive it will be possible for a tool to generate "perfect groups"
in user space and assume the kernel schedules it dumbly, but at least
without any starvation.

-Andi


[PATCH v3] perf tools: Support -x for perf stat report

2020-11-01 Thread Andi Kleen
Add support for the -x, option to enable CSV output with perf stat
report. Useful to parse the information with other programs.

% perf stat record --quiet -a -I 1000 sleep 5
% perf stat report -x,
 1.000838672,4003.55,msec,cpu-clock,4003548736,100.00,,
 1.000838672,11243,,context-switches,4003631885,100.00,0.003,M/sec
 1.000838672,1682,,cpu-migrations,4003672150,100.00,0.420,K/sec
 1.000838672,13244,,page-faults,4003697471,100.00,0.003,M/sec
 1.000838672,2953214077,,cycles,4003715495,100.00,0.738,GHz
 1.000838672,4380820799,,instructions,4003738270,100.00,1.48,insn per cycle
 1.000838672,809865653,,branches,4003760616,100.00,202.287,M/sec
 1.000838672,12439843,,branch-misses,4003780785,100.00,1.54,of all branches
...

Signed-off-by: Andi Kleen 

---

v2: Fix default output (Jiri). Also handle \t as special value like the
original parser (Andi)
v3: Use DEFAULT_SEPARATOR
---
 tools/perf/builtin-stat.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 743fe47e7a88..9fcc7351ce43 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1988,6 +1988,8 @@ static int __cmd_report(int argc, const char **argv)
 "aggregate counts per numa node", AGGR_NODE),
OPT_SET_UINT('A', "no-aggr", _stat.aggr_mode,
 "disable CPU count aggregation", AGGR_NONE),
+   OPT_STRING('x', "field-separator", _config.csv_sep, "separator",
+  "print counts with custom separator"),
OPT_END()
};
struct stat st;
@@ -2002,6 +2004,13 @@ static int __cmd_report(int argc, const char **argv)
input_name = "perf.data";
}
 
+   if (strcmp(stat_config.csv_sep, DEFAULT_SEPARATOR)) {
+   stat_config.csv_output = true;
+   if (!strcmp(stat_config.csv_sep, "\\t"))
+   stat_config.csv_sep = "\t";
+   stat_config.big_num = false;
+   }
+
perf_stat.data.path = input_name;
perf_stat.data.mode = PERF_DATA_MODE_READ;
 
-- 
2.28.0



[PATCH v2] perf tools: Support -x for perf stat report

2020-10-28 Thread Andi Kleen
Add support for the -x, option to enable CSV output with perf stat
report. Useful to parse the information with other programs.

% perf stat record --quiet -a -I 1000 sleep 5
% perf stat report -x,
 1.000838672,4003.55,msec,cpu-clock,4003548736,100.00,,
 1.000838672,11243,,context-switches,4003631885,100.00,0.003,M/sec
 1.000838672,1682,,cpu-migrations,4003672150,100.00,0.420,K/sec
 1.000838672,13244,,page-faults,4003697471,100.00,0.003,M/sec
 1.000838672,2953214077,,cycles,4003715495,100.00,0.738,GHz
 1.000838672,4380820799,,instructions,4003738270,100.00,1.48,insn per cycle
 1.000838672,809865653,,branches,4003760616,100.00,202.287,M/sec
 1.000838672,12439843,,branch-misses,4003780785,100.00,1.54,of all branches
...

Signed-off-by: Andi Kleen 

---

v2: Fix default output (Jiri). Also handle \t as special value like the
original parser (Andi)
---
 tools/perf/builtin-stat.c | 9 +
 1 file changed, 9 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 743fe47e7a88..2f7eb6b68344 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1988,6 +1988,8 @@ static int __cmd_report(int argc, const char **argv)
 "aggregate counts per numa node", AGGR_NODE),
OPT_SET_UINT('A', "no-aggr", _stat.aggr_mode,
 "disable CPU count aggregation", AGGR_NONE),
+   OPT_STRING('x', "field-separator", _config.csv_sep, "separator",
+  "print counts with custom separator"),
OPT_END()
};
struct stat st;
@@ -2002,6 +2004,13 @@ static int __cmd_report(int argc, const char **argv)
input_name = "perf.data";
}
 
+   if (strcmp(stat_config.csv_sep, " ")) {
+   stat_config.csv_output = true;
+   if (!strcmp(stat_config.csv_sep, "\\t"))
+   stat_config.csv_sep = "\t";
+   stat_config.big_num = false;
+   }
+
perf_stat.data.path = input_name;
perf_stat.data.mode = PERF_DATA_MODE_READ;
 
-- 
2.28.0



[PATCH 2/2] perf tools: Support -x for perf stat report

2020-10-26 Thread Andi Kleen
Add support for the -x, option to enable CSV output with perf stat
report. Useful to parse the information with other programs.

% perf stat record --quiet -a -I 1000 sleep 5
% perf stat report -x,
 1.000838672,4003.55,msec,cpu-clock,4003548736,100.00,,
 1.000838672,11243,,context-switches,4003631885,100.00,0.003,M/sec
 1.000838672,1682,,cpu-migrations,4003672150,100.00,0.420,K/sec
 1.000838672,13244,,page-faults,4003697471,100.00,0.003,M/sec
 1.000838672,2953214077,,cycles,4003715495,100.00,0.738,GHz
 1.000838672,4380820799,,instructions,4003738270,100.00,1.48,insn per cycle
 1.000838672,809865653,,branches,4003760616,100.00,202.287,M/sec
 1.000838672,12439843,,branch-misses,4003780785,100.00,1.54,of all branches
...

Signed-off-by: Andi Kleen 
---
 tools/perf/builtin-stat.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index 743fe47e7a88..31e7bd877f1d 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -1988,6 +1988,8 @@ static int __cmd_report(int argc, const char **argv)
 "aggregate counts per numa node", AGGR_NODE),
OPT_SET_UINT('A', "no-aggr", _stat.aggr_mode,
 "disable CPU count aggregation", AGGR_NONE),
+   OPT_STRING('x', "field-separator", _config.csv_sep, "separator",
+  "print counts with custom separator"),
OPT_END()
};
struct stat st;
@@ -2002,6 +2004,11 @@ static int __cmd_report(int argc, const char **argv)
input_name = "perf.data";
}
 
+   if (stat_config.csv_sep) {
+   stat_config.csv_output = true;
+   stat_config.big_num = false;
+   }
+
perf_stat.data.path = input_name;
perf_stat.data.mode = PERF_DATA_MODE_READ;
 
-- 
2.28.0



[PATCH 1/2] perf tools: Add --quiet option to perf stat

2020-10-26 Thread Andi Kleen
Add a new --quiet option to perf stat. This is useful with perf stat
record to write the data only to the perf.data file, which can lower
measurement overhead because the data doesn't need to be formatted.

On my 4C desktop:

% time ./perf stat record  -e $(python -c 'print ",".join(["cycles"]*1000)')  
-a -I 1000 sleep 5
...
real0m5.377s
user0m0.238s
sys 0m0.452s
% time ./perf stat record --quiet -e $(python -c 'print 
",".join(["cycles"]*1000)')  -a -I 1000 sleep 5

real0m5.452s
user0m0.183s
sys 0m0.423s

In this example it cuts the user time by 20%. On systems with more cores
the savings are higher.

Signed-off-by: Andi Kleen 
---
 tools/perf/Documentation/perf-stat.txt | 4 
 tools/perf/builtin-stat.c  | 6 +-
 tools/perf/util/stat.h | 1 +
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/tools/perf/Documentation/perf-stat.txt 
b/tools/perf/Documentation/perf-stat.txt
index 9f9f29025e49..b138dd192423 100644
--- a/tools/perf/Documentation/perf-stat.txt
+++ b/tools/perf/Documentation/perf-stat.txt
@@ -316,6 +316,10 @@ small group that need not have multiplexing is lowered. 
This option
 forbids the event merging logic from sharing events between groups and
 may be used to increase accuracy in this case.
 
+--quiet::
+Don't print output. This is useful with perf stat record below to only
+write data to the perf.data file.
+
 STAT RECORD
 ---
 Stores stat data into perf data file.
diff --git a/tools/perf/builtin-stat.c b/tools/perf/builtin-stat.c
index b01af171d94f..743fe47e7a88 100644
--- a/tools/perf/builtin-stat.c
+++ b/tools/perf/builtin-stat.c
@@ -972,6 +972,8 @@ static void print_counters(struct timespec *ts, int argc, 
const char **argv)
/* Do not print anything if we record to the pipe. */
if (STAT_RECORD && perf_stat.data.is_pipe)
return;
+   if (stat_config.quiet)
+   return;
 
perf_evlist__print_counters(evsel_list, _config, ,
ts, argc, argv);
@@ -1171,6 +1173,8 @@ static struct option stat_options[] = {
"threads of same physical core"),
OPT_BOOLEAN(0, "summary", _config.summary,
   "print summary for interval mode"),
+   OPT_BOOLEAN(0, "quiet", _config.quiet,
+   "don't print output (useful with record)"),
 #ifdef HAVE_LIBPFM
OPT_CALLBACK(0, "pfm-events", _list, "event",
"libpfm4 event selector. use 'perf list' to list available 
events",
@@ -2132,7 +2136,7 @@ int cmd_stat(int argc, const char **argv)
goto out;
}
 
-   if (!output) {
+   if (!output && !stat_config.quiet) {
struct timespec tm;
mode = append_file ? "a" : "w";
 
diff --git a/tools/perf/util/stat.h b/tools/perf/util/stat.h
index 487010c624be..05adf8165025 100644
--- a/tools/perf/util/stat.h
+++ b/tools/perf/util/stat.h
@@ -122,6 +122,7 @@ struct perf_stat_config {
bool metric_no_group;
bool metric_no_merge;
bool stop_read_counter;
+   bool quiet;
FILE*output;
unsigned int interval;
unsigned int timeout;
-- 
2.28.0



Re: [RFC] perf evlist: Warn if event group has mixed sw/hw events

2020-10-26 Thread Andi Kleen
On Mon, Oct 26, 2020 at 11:19:37PM +0900, Namhyung Kim wrote:
> This patch just added a warning before running it.  I'd really want to
> fix the kernel if possible but don't have a good idea.  Thoughts?

The easiest fix would be some multi threading in perf stat opening, then then
extra latencies could be mostly hidden. One thread per group would probably
be overkill, but just a few threads would lower the penalty significantly.

I think that would be better than this patch and it's likely not that much
more complicated, as this is already a lot of code.

> +{
> + const char *known_sw_pmu[] = {
> + "software", "tracepoint", "breakpoint", "kprobe", "uprobe", 
> "msr"

That's a non scalable approach. New pmus get added regularly. It would be 
better to
indicate this in a generic way from the kernel.

> + pr_warning("WARNING: Event group has mixed hw/sw 
> events.\n"
> +"This will slow down the perf_event_open 
> syscall.\n"
> +"Consider putting a hw event as a 
> leader.\n\n");

You really need to tell the user which group, otherwise it is hard to find
in a large command line.


-Andi


Re: [PATCH 1/2] perf test: Use generic event for expand_libpfm_events()

2020-10-24 Thread Andi Kleen
On Sat, Oct 24, 2020 at 11:59:17AM +0900, Namhyung Kim wrote:
> I found that the UNHALTED_CORE_CYCLES event is only available in the
> Intel machines and it makes other vendors/archs fail on the test.  As
> libpfm4 can parse the generic events like cycles, let's use them.
> 
> Fixes: 40b74c30ffb9 ("perf test: Add expand cgroup event test")
> Signed-off-by: Namhyung Kim 

So would the test still fail when libpfm is not compiled in?

-Andi


Re: [PATCH] perf stat: Support regex pattern in --for-each-cgroup

2020-10-23 Thread Andi Kleen
On Fri, Oct 23, 2020 at 04:42:34PM +0900, Namhyung Kim wrote:
> To make the command line even more compact with cgroups, support regex
> pattern matching in cgroup names.
> 
>   $ perf stat -a -e cpu-clock,cycles --for-each-cgroup '^.$' sleep 1

The example doesn't exactly show the benefit. So ^.$ would be only
for one character cgroups?

Missing documentation updates.

-Andi



Re: [PATCH] perf/x86/intel: make anythread filter support conditional

2020-10-21 Thread Andi Kleen
On Wed, Oct 21, 2020 at 02:16:12PM -0700, Stephane Eranian wrote:
> Starting with Arch Perfmon v5, the anythread filter on generic counters may be
> deprecated. The current kernel was exporting the any filter without checking.
> On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
> does not exist. This patch corrects the problem by relying on the CPUID 0xa 
> leaf
> function to determine if anythread is supported or not as described in the
> Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.

Reviewed-by: Andi Kleen 

-Andi


Re: [PATCH v1 02/15] perf report: output trace file name in raw trace dump

2020-10-20 Thread Andi Kleen
> > After brief evaluation it still doesn't look easy. The simplest thing
> > I could imagine is to probably combine file_path and file_offset at a
> > struct object on stack and then pass the object by the reference along
> > function calls. I expect it will roughly cause the same amount of changes
> > in the code and saves one argument (GP register). It is not that much
> > but still. However I don't see issues with passing even 6 args on stack.
> 
> Sorry - "passing 6 args to a function call"

Ah it wasn't about code efficiency, just maintainability. Function calls
with too many arguments are generally hard to maintain, and typically
at some point have to be refactored anyways because they tend to grow
even more over time.

But if it's too difficult, ok.

-Andi


Re: [PATCH] perf/x86/intel: Add event constraint for CYCLE_ACTIVITY.STALLS_MEM_ANY

2020-10-19 Thread Andi Kleen
On Mon, Oct 19, 2020 at 08:01:58AM -0700, kan.li...@linux.intel.com wrote:
> Add a line for the CYCLE_ACTIVITY.STALLS_MEM_ANY event in the ICL
> constraint table.
> Correct the comments for the CYCLE_ACTIVITY.CYCLES_MEM_ANY event.

Thanks Kan.

> 
> Reported-by: Andi Kleen 
> Signed-off-by: Kan Liang 

I guess this should have a Fixes: tag and also be proposed for
stable.

-Andi



  1   2   3   4   5   6   7   8   9   10   >