Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-03-11 Thread Jan Glauber
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:

[...]

> > +int thunder_uncore_event_init(struct perf_event *event)
> > +{
> > +   struct hw_perf_event *hwc = >hw;
> > +   struct thunder_uncore *uncore;
> > +
> > +   if (event->attr.type != event->pmu->type)
> > +   return -ENOENT;
> > +
> > +   /* we do not support sampling */
> > +   if (is_sampling_event(event))
> > +   return -EINVAL;
> > +
> > +   /* counters do not have these bits */
> > +   if (event->attr.exclude_user||
> > +   event->attr.exclude_kernel  ||
> > +   event->attr.exclude_host||
> > +   event->attr.exclude_guest   ||
> > +   event->attr.exclude_hv  ||
> > +   event->attr.exclude_idle)
> > +   return -EINVAL;
> 
> We should _really_ make these features opt-in at the core level. It's
> crazy that each and every PMU drivers has to explicitly test and reject
> things it doesn't support.
> 

Looking at the exclude_* feature handling in pmu->event_init under
arch/ shows lots of differences. Just as an example, exclude_idle
returns sometimes -EINVAL, sometimes -EPERM while other archs
ignore it and one silently deletes the flag.

So it looks hard to me to move the exclude handling into
perf core while keeping the per-arch differences. And if we
don't and return an error on the perf_event_open syscall in a newer
kernel for an exclude bit that was previously ignored it will be a
user-space regression, right?

Looking only at the uncore drivers (plus drivers/bus/arm-cc*)
it looks like they all don't support any exclude bits but the
handling here differs also. The table shows the current behaviour,
X means the requested exclude flag is refused.

  user  kernel  hostguesthv idle
-
x86 uncore intel|  X  X  X   X
x86 uncore intel snb|  X  X  X   X   X   X
x86 uncore intel cqm|  X  X  X   X   X   X
x86 uncore intel cstate |  X  X  X   X   X   X
x86 uncore intel rapl   |  X  X  X   X   X   X
x86 uncore amd  |  X  X  X   X
x86 uncore amd iommu|  X  X  X   X
x86 uncore amd ibs  |  X  X  X   X   X   X
arm bus cci |  X  X  X   X   X   X
arm bus ccn |  X  X  X   X
--

How about simply adding a helper function to include/linux/perf_event.h
that checks if _any_ of the exclude bits is set? We could then
simplify the check-for-any exclude flag to:

if (any_exclude_set(event))
return -EINVAL;

What's your opinion?

Jan

---

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5c5a3f..0eacdba0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -849,6 +849,18 @@ static inline bool is_sampling_event(struct perf_event 
*event)
return event->attr.sample_period != 0;
 }
 
+static inline int any_exclude_set(struct perf_event *event)
+{
+   if (event->attr.exclude_user||
+   event->attr.exclude_kernel  ||
+   event->attr.exclude_host||
+   event->attr.exclude_guest   ||
+   event->attr.exclude_hv  ||
+   event->attr.exclude_idle)
+   return 1;
+   return 0;
+}
+
 /*
  * Return 1 for a software event, 0 for a hardware event
  */


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-03-11 Thread Jan Glauber
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:

[...]

> > +int thunder_uncore_event_init(struct perf_event *event)
> > +{
> > +   struct hw_perf_event *hwc = >hw;
> > +   struct thunder_uncore *uncore;
> > +
> > +   if (event->attr.type != event->pmu->type)
> > +   return -ENOENT;
> > +
> > +   /* we do not support sampling */
> > +   if (is_sampling_event(event))
> > +   return -EINVAL;
> > +
> > +   /* counters do not have these bits */
> > +   if (event->attr.exclude_user||
> > +   event->attr.exclude_kernel  ||
> > +   event->attr.exclude_host||
> > +   event->attr.exclude_guest   ||
> > +   event->attr.exclude_hv  ||
> > +   event->attr.exclude_idle)
> > +   return -EINVAL;
> 
> We should _really_ make these features opt-in at the core level. It's
> crazy that each and every PMU drivers has to explicitly test and reject
> things it doesn't support.
> 

Looking at the exclude_* feature handling in pmu->event_init under
arch/ shows lots of differences. Just as an example, exclude_idle
returns sometimes -EINVAL, sometimes -EPERM while other archs
ignore it and one silently deletes the flag.

So it looks hard to me to move the exclude handling into
perf core while keeping the per-arch differences. And if we
don't and return an error on the perf_event_open syscall in a newer
kernel for an exclude bit that was previously ignored it will be a
user-space regression, right?

Looking only at the uncore drivers (plus drivers/bus/arm-cc*)
it looks like they all don't support any exclude bits but the
handling here differs also. The table shows the current behaviour,
X means the requested exclude flag is refused.

  user  kernel  hostguesthv idle
-
x86 uncore intel|  X  X  X   X
x86 uncore intel snb|  X  X  X   X   X   X
x86 uncore intel cqm|  X  X  X   X   X   X
x86 uncore intel cstate |  X  X  X   X   X   X
x86 uncore intel rapl   |  X  X  X   X   X   X
x86 uncore amd  |  X  X  X   X
x86 uncore amd iommu|  X  X  X   X
x86 uncore amd ibs  |  X  X  X   X   X   X
arm bus cci |  X  X  X   X   X   X
arm bus ccn |  X  X  X   X
--

How about simply adding a helper function to include/linux/perf_event.h
that checks if _any_ of the exclude bits is set? We could then
simplify the check-for-any exclude flag to:

if (any_exclude_set(event))
return -EINVAL;

What's your opinion?

Jan

---

diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
index f5c5a3f..0eacdba0 100644
--- a/include/linux/perf_event.h
+++ b/include/linux/perf_event.h
@@ -849,6 +849,18 @@ static inline bool is_sampling_event(struct perf_event 
*event)
return event->attr.sample_period != 0;
 }
 
+static inline int any_exclude_set(struct perf_event *event)
+{
+   if (event->attr.exclude_user||
+   event->attr.exclude_kernel  ||
+   event->attr.exclude_host||
+   event->attr.exclude_guest   ||
+   event->attr.exclude_hv  ||
+   event->attr.exclude_idle)
+   return 1;
+   return 0;
+}
+
 /*
  * Return 1 for a software event, 0 for a hardware event
  */


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-16 Thread Jan Glauber
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> > Provide uncore facilities for non-CPU performance counter units.
> > Based on Intel/AMD uncore pmu support.
> > 
> > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > All counters are exported via sysfs in the corresponding events
> > files under the PMU directory so the perf tool can list the event names.
> 
> It turns out that "uncore" covers quite a lot of things.
> 
> Where exactly do the see counters live? system, socket, cluster?
> 
> Are there potentially multiple instances of a given PMU in the system?
> e.g. might each clutster have an instance of an L2 PMU?

Thinking twice about the multi-node systems I would like to change
the implementation to not merge counters across nodes. There might
be value in having the counters per node and also performance wise
it would be better to merge counters only on the local node.

I'll introduce a node sysfs attribute that can be combined with
the event and fill the node with the numa id.

[...]

> 
> Otherwise, are they associated with some power domain?

No, power domains are not used for the "uncore" related devices.
These devices are currently always on.

Jan


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-16 Thread Jan Glauber
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> > Provide uncore facilities for non-CPU performance counter units.
> > Based on Intel/AMD uncore pmu support.
> > 
> > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > All counters are exported via sysfs in the corresponding events
> > files under the PMU directory so the perf tool can list the event names.
> 
> It turns out that "uncore" covers quite a lot of things.
> 
> Where exactly do the see counters live? system, socket, cluster?
> 
> Are there potentially multiple instances of a given PMU in the system?
> e.g. might each clutster have an instance of an L2 PMU?

Thinking twice about the multi-node systems I would like to change
the implementation to not merge counters across nodes. There might
be value in having the counters per node and also performance wise
it would be better to merge counters only on the local node.

I'll introduce a node sysfs attribute that can be combined with
the event and fill the node with the numa id.

[...]

> 
> Otherwise, are they associated with some power domain?

No, power domains are not used for the "uncore" related devices.
These devices are currently always on.

Jan


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Jan Glauber
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote:
> > > > 1) The PMU detection solely relies on PCI device detection. If a
> > > >matching PCI device is found the PMU is created. The code can deal
> > > >with multiple units of the same type, e.g. more than one memory
> > > >controller.
> > > 
> > > I see below that the driver has an initcall that runs regardless of
> > > whether the PCI device exists, and looks at the MIDR. That's clearly not
> > > string PCI device detection.
> > > 
> > > Why is this not a true PCI driver that only gets probed if the PCI
> > > device exists? 
> > 
> > It is not a PCI driver because there are already drivers like edac that
> > will access these PCI devices. The uncore driver only accesses the
> > performance counters, which are not used by the other drivers.
> 
> Several drivers are accessing the same device?
> 
> That sounds somewhat scary.

I've double checked that the edac drivers do not access the performance
counters at all. It felt cleaner to me to keep the performance counter code
seperated from edac.

Jan


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Jan Glauber
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote:
> > > > 1) The PMU detection solely relies on PCI device detection. If a
> > > >matching PCI device is found the PMU is created. The code can deal
> > > >with multiple units of the same type, e.g. more than one memory
> > > >controller.
> > > 
> > > I see below that the driver has an initcall that runs regardless of
> > > whether the PCI device exists, and looks at the MIDR. That's clearly not
> > > string PCI device detection.
> > > 
> > > Why is this not a true PCI driver that only gets probed if the PCI
> > > device exists? 
> > 
> > It is not a PCI driver because there are already drivers like edac that
> > will access these PCI devices. The uncore driver only accesses the
> > performance counters, which are not used by the other drivers.
> 
> Several drivers are accessing the same device?
> 
> That sounds somewhat scary.

I've double checked that the edac drivers do not access the performance
counters at all. It felt cleaner to me to keep the performance counter code
seperated from edac.

Jan


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Mark Rutland
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote:
> > > > +   uncore = event_to_thunder_uncore(event);
> > > > +   if (!uncore)
> > > > +   return -ENODEV;
> > > > +   if (!uncore->event_valid(event->attr.config))
> > > > +   return -EINVAL;
> > > > +
> > > > +   hwc->config = event->attr.config;
> > > > +   hwc->idx = -1;
> > > > +
> > > > +   /* and we don't care about CPU */
> > > 
> > > Actually, you do. You want the perf core to serialize accesses via the
> > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise
> > > there are a tonne of problems you don't even want to think about.
> > 
> > I found that perf added the events on every CPU in the system. Because
> > the uncore events are not CPU related I wanted to avoid this. Setting
> > cpumask to -1 did not work. Therefore I added a single CPU in the
> > cpumask, see thunder_uncore_attr_show_cpumask().
> 
> I understand that, which is why I wrote:
> 
> > > You _must_ ensure this kernel-side, regardless of what the perf tool
> > > happens to do.
> > > 
> > > See the arm-cci and arm-ccn drivers for an example.
> 
> Take a look at drivers/bus/arm-cci.c; specifically, what we do in
> cci_pmu_event_init and cci_pmu_cpu_notifier.
> 
> This is the same thing that's done for x86 system PMUs. Take a look at
> uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c.

I note that we still have an open TODO rather than a call to
perf_pmu_migrate_context.

The better example is arm_ccn_pmu_cpu_notifier in drivers/bus/arm-ccn.c.

Mark.


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Mark Rutland
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote:
> > > > +   uncore = event_to_thunder_uncore(event);
> > > > +   if (!uncore)
> > > > +   return -ENODEV;
> > > > +   if (!uncore->event_valid(event->attr.config))
> > > > +   return -EINVAL;
> > > > +
> > > > +   hwc->config = event->attr.config;
> > > > +   hwc->idx = -1;
> > > > +
> > > > +   /* and we don't care about CPU */
> > > 
> > > Actually, you do. You want the perf core to serialize accesses via the
> > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise
> > > there are a tonne of problems you don't even want to think about.
> > 
> > I found that perf added the events on every CPU in the system. Because
> > the uncore events are not CPU related I wanted to avoid this. Setting
> > cpumask to -1 did not work. Therefore I added a single CPU in the
> > cpumask, see thunder_uncore_attr_show_cpumask().
> 
> I understand that, which is why I wrote:
> 
> > > You _must_ ensure this kernel-side, regardless of what the perf tool
> > > happens to do.
> > > 
> > > See the arm-cci and arm-ccn drivers for an example.
> 
> Take a look at drivers/bus/arm-cci.c; specifically, what we do in
> cci_pmu_event_init and cci_pmu_cpu_notifier.
> 
> This is the same thing that's done for x86 system PMUs. Take a look at
> uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c.

I note that we still have an open TODO rather than a call to
perf_pmu_migrate_context.

The better example is arm_ccn_pmu_cpu_notifier in drivers/bus/arm-ccn.c.

Mark.


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Jan Glauber
Hi Mark,

thanks for reviewing! I'll need several mails to address all questions.

On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> > Provide uncore facilities for non-CPU performance counter units.
> > Based on Intel/AMD uncore pmu support.
> > 
> > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > All counters are exported via sysfs in the corresponding events
> > files under the PMU directory so the perf tool can list the event names.
> 
> It turns out that "uncore" covers quite a lot of things.
> 
> Where exactly do the see counters live? system, socket, cluster?

Neither cluster nor socket, so I would say system. Where a system may
consist of 2 nodes that mostly appear as one system.

> Are there potentially multiple instances of a given PMU in the system?
> e.g. might each clutster have an instance of an L2 PMU?

Yes.

> If I turn off a set of CPUs, do any "uncore" PMUs lost context or become
> inaccessible?

No, they are not related to CPUs.

[...]

> > 
> > 1) The PMU detection solely relies on PCI device detection. If a
> >matching PCI device is found the PMU is created. The code can deal
> >with multiple units of the same type, e.g. more than one memory
> >controller.
> 
> I see below that the driver has an initcall that runs regardless of
> whether the PCI device exists, and looks at the MIDR. That's clearly not
> string PCI device detection.
> 
> Why is this not a true PCI driver that only gets probed if the PCI
> device exists? 

It is not a PCI driver because there are already drivers like edac that
will access these PCI devices. The uncore driver only accesses the
performance counters, which are not used by the other drivers.

[...]

> > +#include 
> > +#include 
> 
> I don't see why you should need these two if this is truly an uncore
> device probed solely from PCI.

There are several passes of the hardware that have the same PCI device
ID. Therefore I need the CPU variant to distinguish them. This could
be done _after_ the PCI device is found but I found it easier to
implement the check once in the common setup function.

[...]

> > +int thunder_uncore_event_init(struct perf_event *event)
> > +{
> > +   struct hw_perf_event *hwc = >hw;
> > +   struct thunder_uncore *uncore;
> > +
> > +   if (event->attr.type != event->pmu->type)
> > +   return -ENOENT;
> > +
> > +   /* we do not support sampling */
> > +   if (is_sampling_event(event))
> > +   return -EINVAL;
> > +
> > +   /* counters do not have these bits */
> > +   if (event->attr.exclude_user||
> > +   event->attr.exclude_kernel  ||
> > +   event->attr.exclude_host||
> > +   event->attr.exclude_guest   ||
> > +   event->attr.exclude_hv  ||
> > +   event->attr.exclude_idle)
> > +   return -EINVAL;
> 
> We should _really_ make these features opt-in at the core level. It's
> crazy that each and every PMU drivers has to explicitly test and reject
> things it doesn't support.

Completely agreed. Also, every sample code I looked at did
check for other bits...

[...]

> > +
> > +   uncore = event_to_thunder_uncore(event);
> > +   if (!uncore)
> > +   return -ENODEV;
> > +   if (!uncore->event_valid(event->attr.config))
> > +   return -EINVAL;
> > +
> > +   hwc->config = event->attr.config;
> > +   hwc->idx = -1;
> > +
> > +   /* and we don't care about CPU */
> 
> Actually, you do. You want the perf core to serialize accesses via the
> same CPU, so all events _must_ be targetted at the same CPU. Otherwise
> there are a tonne of problems you don't even want to think about.

I found that perf added the events on every CPU in the system. Because
the uncore events are not CPU related I wanted to avoid this. Setting
cpumask to -1 did not work. Therefore I added a single CPU in the
cpumask, see thunder_uncore_attr_show_cpumask().

> You _must_ ensure this kernel-side, regardless of what the perf tool
> happens to do.
> 
> See the arm-cci and arm-ccn drivers for an example.
> 
> You can also follow the migration approach used there to allow you to
> retain counting across a hotplug.
> 
> [...]
> 
> > +static int __init thunder_uncore_init(void)
> > +{
> > +   unsigned long implementor = read_cpuid_implementor();
> > +   unsigned long part_number = read_cpuid_part_number();
> > +   u32 variant;
> > +
> > +   if (implementor != ARM_CPU_IMP_CAVIUM ||
> > +   part_number != CAVIUM_CPU_PART_THUNDERX)
> > +   return -ENODEV;
> > +
> > +   /* detect pass2 which contains different counters */
> > +   variant = MIDR_VARIANT(read_cpuid_id());
> > +   if (variant == 1)
> > +   thunder_uncore_version = 1;
> > +   pr_info("PMU version: %d\n", thunder_uncore_version);
> > +
> > +   return 0;
> > +}
> 
> You should call out these differences in the commmit message.

OK

> Mark.

Thanks, Jan


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Jan Glauber
Hi Mark,

thanks for reviewing! I'll need several mails to address all questions.

On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> > Provide uncore facilities for non-CPU performance counter units.
> > Based on Intel/AMD uncore pmu support.
> > 
> > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > All counters are exported via sysfs in the corresponding events
> > files under the PMU directory so the perf tool can list the event names.
> 
> It turns out that "uncore" covers quite a lot of things.
> 
> Where exactly do the see counters live? system, socket, cluster?

Neither cluster nor socket, so I would say system. Where a system may
consist of 2 nodes that mostly appear as one system.

> Are there potentially multiple instances of a given PMU in the system?
> e.g. might each clutster have an instance of an L2 PMU?

Yes.

> If I turn off a set of CPUs, do any "uncore" PMUs lost context or become
> inaccessible?

No, they are not related to CPUs.

[...]

> > 
> > 1) The PMU detection solely relies on PCI device detection. If a
> >matching PCI device is found the PMU is created. The code can deal
> >with multiple units of the same type, e.g. more than one memory
> >controller.
> 
> I see below that the driver has an initcall that runs regardless of
> whether the PCI device exists, and looks at the MIDR. That's clearly not
> string PCI device detection.
> 
> Why is this not a true PCI driver that only gets probed if the PCI
> device exists? 

It is not a PCI driver because there are already drivers like edac that
will access these PCI devices. The uncore driver only accesses the
performance counters, which are not used by the other drivers.

[...]

> > +#include 
> > +#include 
> 
> I don't see why you should need these two if this is truly an uncore
> device probed solely from PCI.

There are several passes of the hardware that have the same PCI device
ID. Therefore I need the CPU variant to distinguish them. This could
be done _after_ the PCI device is found but I found it easier to
implement the check once in the common setup function.

[...]

> > +int thunder_uncore_event_init(struct perf_event *event)
> > +{
> > +   struct hw_perf_event *hwc = >hw;
> > +   struct thunder_uncore *uncore;
> > +
> > +   if (event->attr.type != event->pmu->type)
> > +   return -ENOENT;
> > +
> > +   /* we do not support sampling */
> > +   if (is_sampling_event(event))
> > +   return -EINVAL;
> > +
> > +   /* counters do not have these bits */
> > +   if (event->attr.exclude_user||
> > +   event->attr.exclude_kernel  ||
> > +   event->attr.exclude_host||
> > +   event->attr.exclude_guest   ||
> > +   event->attr.exclude_hv  ||
> > +   event->attr.exclude_idle)
> > +   return -EINVAL;
> 
> We should _really_ make these features opt-in at the core level. It's
> crazy that each and every PMU drivers has to explicitly test and reject
> things it doesn't support.

Completely agreed. Also, every sample code I looked at did
check for other bits...

[...]

> > +
> > +   uncore = event_to_thunder_uncore(event);
> > +   if (!uncore)
> > +   return -ENODEV;
> > +   if (!uncore->event_valid(event->attr.config))
> > +   return -EINVAL;
> > +
> > +   hwc->config = event->attr.config;
> > +   hwc->idx = -1;
> > +
> > +   /* and we don't care about CPU */
> 
> Actually, you do. You want the perf core to serialize accesses via the
> same CPU, so all events _must_ be targetted at the same CPU. Otherwise
> there are a tonne of problems you don't even want to think about.

I found that perf added the events on every CPU in the system. Because
the uncore events are not CPU related I wanted to avoid this. Setting
cpumask to -1 did not work. Therefore I added a single CPU in the
cpumask, see thunder_uncore_attr_show_cpumask().

> You _must_ ensure this kernel-side, regardless of what the perf tool
> happens to do.
> 
> See the arm-cci and arm-ccn drivers for an example.
> 
> You can also follow the migration approach used there to allow you to
> retain counting across a hotplug.
> 
> [...]
> 
> > +static int __init thunder_uncore_init(void)
> > +{
> > +   unsigned long implementor = read_cpuid_implementor();
> > +   unsigned long part_number = read_cpuid_part_number();
> > +   u32 variant;
> > +
> > +   if (implementor != ARM_CPU_IMP_CAVIUM ||
> > +   part_number != CAVIUM_CPU_PART_THUNDERX)
> > +   return -ENODEV;
> > +
> > +   /* detect pass2 which contains different counters */
> > +   variant = MIDR_VARIANT(read_cpuid_id());
> > +   if (variant == 1)
> > +   thunder_uncore_version = 1;
> > +   pr_info("PMU version: %d\n", thunder_uncore_version);
> > +
> > +   return 0;
> > +}
> 
> You should call out these differences in the commmit message.

OK

> Mark.

Thanks, Jan


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Mark Rutland
On Mon, Feb 15, 2016 at 03:07:20PM +0100, Jan Glauber wrote:
> Hi Mark,
> 
> thanks for reviewing! I'll need several mails to address all questions.
> 
> On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> > > Provide uncore facilities for non-CPU performance counter units.
> > > Based on Intel/AMD uncore pmu support.
> > > 
> > > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > > All counters are exported via sysfs in the corresponding events
> > > files under the PMU directory so the perf tool can list the event names.
> > 
> > It turns out that "uncore" covers quite a lot of things.
> > 
> > Where exactly do the see counters live? system, socket, cluster?
> 
> Neither cluster nor socket, so I would say system. Where a system may
> consist of 2 nodes that mostly appear as one system.
> 
> > Are there potentially multiple instances of a given PMU in the system?
> > e.g. might each clutster have an instance of an L2 PMU?
> 
> Yes.
> 
> > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become
> > inaccessible?
> 
> No, they are not related to CPUs.

Ok. So I should be able to concurrently hotplug random CPUs on/off while
this driver is running, without issues? No registers might be
clock-gated or similar?

I appreciate that they are not "related" to particular CPUs as such.

> > > 1) The PMU detection solely relies on PCI device detection. If a
> > >matching PCI device is found the PMU is created. The code can deal
> > >with multiple units of the same type, e.g. more than one memory
> > >controller.
> > 
> > I see below that the driver has an initcall that runs regardless of
> > whether the PCI device exists, and looks at the MIDR. That's clearly not
> > string PCI device detection.
> > 
> > Why is this not a true PCI driver that only gets probed if the PCI
> > device exists? 
> 
> It is not a PCI driver because there are already drivers like edac that
> will access these PCI devices. The uncore driver only accesses the
> performance counters, which are not used by the other drivers.

Several drivers are accessing the same device?

That sounds somewhat scary.

> > > +#include 
> > > +#include 
> > 
> > I don't see why you should need these two if this is truly an uncore
> > device probed solely from PCI.
> 
> There are several passes of the hardware that have the same PCI device
> ID. Therefore I need the CPU variant to distinguish them. This could
> be done _after_ the PCI device is found but I found it easier to
> implement the check once in the common setup function.

Ok. Please call that out in the commit message.

> > > +int thunder_uncore_event_init(struct perf_event *event)
> > > +{
> > > + struct hw_perf_event *hwc = >hw;
> > > + struct thunder_uncore *uncore;
> > > +
> > > + if (event->attr.type != event->pmu->type)
> > > + return -ENOENT;
> > > +
> > > + /* we do not support sampling */
> > > + if (is_sampling_event(event))
> > > + return -EINVAL;
> > > +
> > > + /* counters do not have these bits */
> > > + if (event->attr.exclude_user||
> > > + event->attr.exclude_kernel  ||
> > > + event->attr.exclude_host||
> > > + event->attr.exclude_guest   ||
> > > + event->attr.exclude_hv  ||
> > > + event->attr.exclude_idle)
> > > + return -EINVAL;
> > 
> > We should _really_ make these features opt-in at the core level. It's
> > crazy that each and every PMU drivers has to explicitly test and reject
> > things it doesn't support.
> 
> Completely agreed. Also, every sample code I looked at did
> check for other bits...
> 
> [...]
> 
> > > +
> > > + uncore = event_to_thunder_uncore(event);
> > > + if (!uncore)
> > > + return -ENODEV;
> > > + if (!uncore->event_valid(event->attr.config))
> > > + return -EINVAL;
> > > +
> > > + hwc->config = event->attr.config;
> > > + hwc->idx = -1;
> > > +
> > > + /* and we don't care about CPU */
> > 
> > Actually, you do. You want the perf core to serialize accesses via the
> > same CPU, so all events _must_ be targetted at the same CPU. Otherwise
> > there are a tonne of problems you don't even want to think about.
> 
> I found that perf added the events on every CPU in the system. Because
> the uncore events are not CPU related I wanted to avoid this. Setting
> cpumask to -1 did not work. Therefore I added a single CPU in the
> cpumask, see thunder_uncore_attr_show_cpumask().

I understand that, which is why I wrote:

> > You _must_ ensure this kernel-side, regardless of what the perf tool
> > happens to do.
> > 
> > See the arm-cci and arm-ccn drivers for an example.

Take a look at drivers/bus/arm-cci.c; specifically, what we do in
cci_pmu_event_init and cci_pmu_cpu_notifier.

This is the same thing that's done for x86 system PMUs. Take a look at
uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c.

Otherwise there are a number of situations where 

Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Mark Rutland
On Mon, Feb 15, 2016 at 03:07:20PM +0100, Jan Glauber wrote:
> Hi Mark,
> 
> thanks for reviewing! I'll need several mails to address all questions.
> 
> On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote:
> > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> > > Provide uncore facilities for non-CPU performance counter units.
> > > Based on Intel/AMD uncore pmu support.
> > > 
> > > The uncore PMUs can be found under /sys/bus/event_source/devices.
> > > All counters are exported via sysfs in the corresponding events
> > > files under the PMU directory so the perf tool can list the event names.
> > 
> > It turns out that "uncore" covers quite a lot of things.
> > 
> > Where exactly do the see counters live? system, socket, cluster?
> 
> Neither cluster nor socket, so I would say system. Where a system may
> consist of 2 nodes that mostly appear as one system.
> 
> > Are there potentially multiple instances of a given PMU in the system?
> > e.g. might each clutster have an instance of an L2 PMU?
> 
> Yes.
> 
> > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become
> > inaccessible?
> 
> No, they are not related to CPUs.

Ok. So I should be able to concurrently hotplug random CPUs on/off while
this driver is running, without issues? No registers might be
clock-gated or similar?

I appreciate that they are not "related" to particular CPUs as such.

> > > 1) The PMU detection solely relies on PCI device detection. If a
> > >matching PCI device is found the PMU is created. The code can deal
> > >with multiple units of the same type, e.g. more than one memory
> > >controller.
> > 
> > I see below that the driver has an initcall that runs regardless of
> > whether the PCI device exists, and looks at the MIDR. That's clearly not
> > string PCI device detection.
> > 
> > Why is this not a true PCI driver that only gets probed if the PCI
> > device exists? 
> 
> It is not a PCI driver because there are already drivers like edac that
> will access these PCI devices. The uncore driver only accesses the
> performance counters, which are not used by the other drivers.

Several drivers are accessing the same device?

That sounds somewhat scary.

> > > +#include 
> > > +#include 
> > 
> > I don't see why you should need these two if this is truly an uncore
> > device probed solely from PCI.
> 
> There are several passes of the hardware that have the same PCI device
> ID. Therefore I need the CPU variant to distinguish them. This could
> be done _after_ the PCI device is found but I found it easier to
> implement the check once in the common setup function.

Ok. Please call that out in the commit message.

> > > +int thunder_uncore_event_init(struct perf_event *event)
> > > +{
> > > + struct hw_perf_event *hwc = >hw;
> > > + struct thunder_uncore *uncore;
> > > +
> > > + if (event->attr.type != event->pmu->type)
> > > + return -ENOENT;
> > > +
> > > + /* we do not support sampling */
> > > + if (is_sampling_event(event))
> > > + return -EINVAL;
> > > +
> > > + /* counters do not have these bits */
> > > + if (event->attr.exclude_user||
> > > + event->attr.exclude_kernel  ||
> > > + event->attr.exclude_host||
> > > + event->attr.exclude_guest   ||
> > > + event->attr.exclude_hv  ||
> > > + event->attr.exclude_idle)
> > > + return -EINVAL;
> > 
> > We should _really_ make these features opt-in at the core level. It's
> > crazy that each and every PMU drivers has to explicitly test and reject
> > things it doesn't support.
> 
> Completely agreed. Also, every sample code I looked at did
> check for other bits...
> 
> [...]
> 
> > > +
> > > + uncore = event_to_thunder_uncore(event);
> > > + if (!uncore)
> > > + return -ENODEV;
> > > + if (!uncore->event_valid(event->attr.config))
> > > + return -EINVAL;
> > > +
> > > + hwc->config = event->attr.config;
> > > + hwc->idx = -1;
> > > +
> > > + /* and we don't care about CPU */
> > 
> > Actually, you do. You want the perf core to serialize accesses via the
> > same CPU, so all events _must_ be targetted at the same CPU. Otherwise
> > there are a tonne of problems you don't even want to think about.
> 
> I found that perf added the events on every CPU in the system. Because
> the uncore events are not CPU related I wanted to avoid this. Setting
> cpumask to -1 did not work. Therefore I added a single CPU in the
> cpumask, see thunder_uncore_attr_show_cpumask().

I understand that, which is why I wrote:

> > You _must_ ensure this kernel-side, regardless of what the perf tool
> > happens to do.
> > 
> > See the arm-cci and arm-ccn drivers for an example.

Take a look at drivers/bus/arm-cci.c; specifically, what we do in
cci_pmu_event_init and cci_pmu_cpu_notifier.

This is the same thing that's done for x86 system PMUs. Take a look at
uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c.

Otherwise there are a number of situations where 

Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Mark Rutland
On Fri, Feb 12, 2016 at 05:47:25PM -0800, David Daney wrote:
> On 02/12/2016 09:36 AM, Mark Rutland wrote:
> >On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> [...]
> >>2) Counters are summarized across the different units of the same type,
> >>e.g. L2C TAD 0..7 is presented as a single counter (adding the
> >>values from TAD 0 to 7). Although losing the ability to read a
> >>single value the merged values are easier to use and yield
> >>enough information.
> >
> >I'm not sure I follow this. What is easier? What are you doing, and what
> >are you comparing that with to say that your approach is easier?
> >
> >It sounds like it should be possible to handle multiple counters like
> >this, so I don't follow why you want to amalgamate them in-kernel.
> >
> 
> The values of the individual counters are close to meaningless.  The
> only thing that is meaningful to someone reading the counters is the
> aggregate sum of all the counts.

I obviously know nowhere near enough about your system to say with
certainty, but it may turn out that being able to track counters
individually is useful for some profiling/debugging scenario. How
meaningful the individual counts are really depends on what you're
trying to figure out.

If you believe that aggregate values are sufficient, then I'm happy to
leave that as-is.

> >>+void thunder_uncore_read(struct perf_event *event)
> >>+{
> >>+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
> >>+   struct hw_perf_event *hwc = >hw;
> >>+   u64 prev, new = 0;
> >>+   s64 delta;
> >>+   int i;
> >>+
> >>+   /*
> >>+* since we do not enable counter overflow interrupts,
> >>+* we do not have to worry about prev_count changing on us
> >>+*/
> >
> >Without overflow interrupts, how do you ensure that you account for
> >overflow in a reasonable time window (i.e. before the counter runs past
> >its initial value)?
> 
> Two reasons:
> 
>   1) There are no interrupts.
> 
>   2) The counters are 64-bit, they never overflow.

Ok. Please point this out in the comment so that reviewers aren't
misled. Stating that we don't enable an interrupt implies that said
interrupt exists.

> >>+   /* and we do not enable counter overflow interrupts */
> >
> >That statement raises far more questions than it answers.
> >
> >_why_ do we not user overflow interrupts?
> 
> As stated above, there are *no* overflow interrupts.

Ok. As stated above, please fix this comment to not mislead.

> The events we are counting cannot be attributed to any one (or even
> any) CPU, they run asynchronous to the CPU, so even if there were
> interrupts, they would be meaningless.

Yes, they are meaningless w.r.t. the state of an arbitrary CPU.

Were they to exist you could use them to drive other snapshotting of the
state of the uncore PMU, to get an idea of the frequency/stability of
events over time, etc. Userspace might then decide to snapshot other
whole system state based on events fed to it.

That's moot if they don't exist.

Mark.


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-15 Thread Mark Rutland
On Fri, Feb 12, 2016 at 05:47:25PM -0800, David Daney wrote:
> On 02/12/2016 09:36 AM, Mark Rutland wrote:
> >On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> [...]
> >>2) Counters are summarized across the different units of the same type,
> >>e.g. L2C TAD 0..7 is presented as a single counter (adding the
> >>values from TAD 0 to 7). Although losing the ability to read a
> >>single value the merged values are easier to use and yield
> >>enough information.
> >
> >I'm not sure I follow this. What is easier? What are you doing, and what
> >are you comparing that with to say that your approach is easier?
> >
> >It sounds like it should be possible to handle multiple counters like
> >this, so I don't follow why you want to amalgamate them in-kernel.
> >
> 
> The values of the individual counters are close to meaningless.  The
> only thing that is meaningful to someone reading the counters is the
> aggregate sum of all the counts.

I obviously know nowhere near enough about your system to say with
certainty, but it may turn out that being able to track counters
individually is useful for some profiling/debugging scenario. How
meaningful the individual counts are really depends on what you're
trying to figure out.

If you believe that aggregate values are sufficient, then I'm happy to
leave that as-is.

> >>+void thunder_uncore_read(struct perf_event *event)
> >>+{
> >>+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
> >>+   struct hw_perf_event *hwc = >hw;
> >>+   u64 prev, new = 0;
> >>+   s64 delta;
> >>+   int i;
> >>+
> >>+   /*
> >>+* since we do not enable counter overflow interrupts,
> >>+* we do not have to worry about prev_count changing on us
> >>+*/
> >
> >Without overflow interrupts, how do you ensure that you account for
> >overflow in a reasonable time window (i.e. before the counter runs past
> >its initial value)?
> 
> Two reasons:
> 
>   1) There are no interrupts.
> 
>   2) The counters are 64-bit, they never overflow.

Ok. Please point this out in the comment so that reviewers aren't
misled. Stating that we don't enable an interrupt implies that said
interrupt exists.

> >>+   /* and we do not enable counter overflow interrupts */
> >
> >That statement raises far more questions than it answers.
> >
> >_why_ do we not user overflow interrupts?
> 
> As stated above, there are *no* overflow interrupts.

Ok. As stated above, please fix this comment to not mislead.

> The events we are counting cannot be attributed to any one (or even
> any) CPU, they run asynchronous to the CPU, so even if there were
> interrupts, they would be meaningless.

Yes, they are meaningless w.r.t. the state of an arbitrary CPU.

Were they to exist you could use them to drive other snapshotting of the
state of the uncore PMU, to get an idea of the frequency/stability of
events over time, etc. Userspace might then decide to snapshot other
whole system state based on events fed to it.

That's moot if they don't exist.

Mark.


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-12 Thread David Daney

On 02/12/2016 09:36 AM, Mark Rutland wrote:

On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:

[...]

2) Counters are summarized across the different units of the same type,
e.g. L2C TAD 0..7 is presented as a single counter (adding the
values from TAD 0 to 7). Although losing the ability to read a
single value the merged values are easier to use and yield
enough information.


I'm not sure I follow this. What is easier? What are you doing, and what
are you comparing that with to say that your approach is easier?

It sounds like it should be possible to handle multiple counters like
this, so I don't follow why you want to amalgamate them in-kernel.



The values of the individual counters are close to meaningless.  The 
only thing that is meaningful to someone reading the counters is the 
aggregate sum of all the counts.




[...]


+#include 
+#include 


I don't see why you should need these two if this is truly an uncore
device probed solely from PCI.


+void thunder_uncore_read(struct perf_event *event)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   u64 prev, new = 0;
+   s64 delta;
+   int i;
+
+   /*
+* since we do not enable counter overflow interrupts,
+* we do not have to worry about prev_count changing on us
+*/


Without overflow interrupts, how do you ensure that you account for
overflow in a reasonable time window (i.e. before the counter runs past
its initial value)?


Two reasons:

  1) There are no interrupts.

  2) The counters are 64-bit, they never overflow.




+
+   prev = local64_read(>prev_count);
+
+   /* read counter values from all units */
+   for (i = 0; i < uncore->nr_units; i++)
+   new += readq(map_offset(hwc->event_base, uncore, i));


There's no bit to determine whether an overflow occurred?


No.





+
+   local64_set(>prev_count, new);
+   delta = new - prev;
+   local64_add(delta, >count);
+}
+
+void thunder_uncore_del(struct perf_event *event, int flags)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   int i;
+
+   event->pmu->stop(event, PERF_EF_UPDATE);
+
+   for (i = 0; i < uncore->num_counters; i++) {
+   if (cmpxchg(>events[i], event, NULL) == event)
+   break;
+   }
+   hwc->idx = -1;
+}


Why not just place the event at uncode->events[hwc->idx] ?

Theat way removing the event is trivial.


+int thunder_uncore_event_init(struct perf_event *event)
+{
+   struct hw_perf_event *hwc = >hw;
+   struct thunder_uncore *uncore;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* we do not support sampling */
+   if (is_sampling_event(event))
+   return -EINVAL;
+
+   /* counters do not have these bits */
+   if (event->attr.exclude_user ||
+   event->attr.exclude_kernel   ||
+   event->attr.exclude_host ||
+   event->attr.exclude_guest||
+   event->attr.exclude_hv   ||
+   event->attr.exclude_idle)
+   return -EINVAL;


We should _really_ make these features opt-in at the core level. It's
crazy that each and every PMU drivers has to explicitly test and reject
things it doesn't support.


+
+   /* and we do not enable counter overflow interrupts */


That statement raises far more questions than it answers.

_why_ do we not user overflow interrupts?


As stated above, there are *no* overflow interrupts.

The events we are counting cannot be attributed to any one (or even any) 
CPU, they run asynchronous to the CPU, so even if there were interrupts, 
they would be meaningless.



David Daney


Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-12 Thread Mark Rutland
On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> Provide uncore facilities for non-CPU performance counter units.
> Based on Intel/AMD uncore pmu support.
> 
> The uncore PMUs can be found under /sys/bus/event_source/devices.
> All counters are exported via sysfs in the corresponding events
> files under the PMU directory so the perf tool can list the event names.

It turns out that "uncore" covers quite a lot of things.

Where exactly do the see counters live? system, socket, cluster?

Are there potentially multiple instances of a given PMU in the system?
e.g. might each clutster have an instance of an L2 PMU?

If I turn off a set of CPUs, do any "uncore" PMUs lost context or become
inaccessible?

Otherwise, are they associated with some power domain?

> There are 2 points that are special in this implementation:
> 
> 1) The PMU detection solely relies on PCI device detection. If a
>matching PCI device is found the PMU is created. The code can deal
>with multiple units of the same type, e.g. more than one memory
>controller.

I see below that the driver has an initcall that runs regardless of
whether the PCI device exists, and looks at the MIDR. That's clearly not
string PCI device detection.

Why is this not a true PCI driver that only gets probed if the PCI
device exists? 

> 2) Counters are summarized across the different units of the same type,
>e.g. L2C TAD 0..7 is presented as a single counter (adding the
>values from TAD 0 to 7). Although losing the ability to read a
>single value the merged values are easier to use and yield
>enough information.

I'm not sure I follow this. What is easier? What are you doing, and what
are you comparing that with to say that your approach is easier?

It sounds like it should be possible to handle multiple counters like
this, so I don't follow why you want to amalgamate them in-kernel.

[...]

> +#include 
> +#include 

I don't see why you should need these two if this is truly an uncore
device probed solely from PCI.

> +void thunder_uncore_read(struct perf_event *event)
> +{
> + struct thunder_uncore *uncore = event_to_thunder_uncore(event);
> + struct hw_perf_event *hwc = >hw;
> + u64 prev, new = 0;
> + s64 delta;
> + int i;
> +
> + /*
> +  * since we do not enable counter overflow interrupts,
> +  * we do not have to worry about prev_count changing on us
> +  */

Without overflow interrupts, how do you ensure that you account for
overflow in a reasonable time window (i.e. before the counter runs past
its initial value)?

> +
> + prev = local64_read(>prev_count);
> +
> + /* read counter values from all units */
> + for (i = 0; i < uncore->nr_units; i++)
> + new += readq(map_offset(hwc->event_base, uncore, i));

There's no bit to determine whether an overflow occurred?

> +
> + local64_set(>prev_count, new);
> + delta = new - prev;
> + local64_add(delta, >count);
> +}
> +
> +void thunder_uncore_del(struct perf_event *event, int flags)
> +{
> + struct thunder_uncore *uncore = event_to_thunder_uncore(event);
> + struct hw_perf_event *hwc = >hw;
> + int i;
> +
> + event->pmu->stop(event, PERF_EF_UPDATE);
> +
> + for (i = 0; i < uncore->num_counters; i++) {
> + if (cmpxchg(>events[i], event, NULL) == event)
> + break;
> + }
> + hwc->idx = -1;
> +}

Why not just place the event at uncode->events[hwc->idx] ?

Theat way removing the event is trivial.

> +int thunder_uncore_event_init(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore *uncore;
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + /* we do not support sampling */
> + if (is_sampling_event(event))
> + return -EINVAL;
> +
> + /* counters do not have these bits */
> + if (event->attr.exclude_user||
> + event->attr.exclude_kernel  ||
> + event->attr.exclude_host||
> + event->attr.exclude_guest   ||
> + event->attr.exclude_hv  ||
> + event->attr.exclude_idle)
> + return -EINVAL;

We should _really_ make these features opt-in at the core level. It's
crazy that each and every PMU drivers has to explicitly test and reject
things it doesn't support.

> +
> + /* and we do not enable counter overflow interrupts */

That statement raises far more questions than it answers.

_why_ do we not user overflow interrupts?

> +
> + uncore = event_to_thunder_uncore(event);
> + if (!uncore)
> + return -ENODEV;
> + if (!uncore->event_valid(event->attr.config))
> + return -EINVAL;
> +
> + hwc->config = event->attr.config;
> + hwc->idx = -1;
> +
> + /* and we don't care about CPU */

Actually, you do. You want the perf core to serialize accesses via the
same CPU, so all events _must_ be targetted at the same CPU. Otherwise
there 

[RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-12 Thread Jan Glauber
Provide uncore facilities for non-CPU performance counter units.
Based on Intel/AMD uncore pmu support.

The uncore PMUs can be found under /sys/bus/event_source/devices.
All counters are exported via sysfs in the corresponding events
files under the PMU directory so the perf tool can list the event names.

There are 2 points that are special in this implementation:

1) The PMU detection solely relies on PCI device detection. If a
   matching PCI device is found the PMU is created. The code can deal
   with multiple units of the same type, e.g. more than one memory
   controller.

2) Counters are summarized across the different units of the same type,
   e.g. L2C TAD 0..7 is presented as a single counter (adding the
   values from TAD 0 to 7). Although losing the ability to read a
   single value the merged values are easier to use and yield
   enough information.

Signed-off-by: Jan Glauber 
---
 arch/arm64/kernel/Makefile   |   1 +
 arch/arm64/kernel/uncore/Makefile|   1 +
 arch/arm64/kernel/uncore/uncore_cavium.c | 210 +++
 arch/arm64/kernel/uncore/uncore_cavium.h |  73 +++
 4 files changed, 285 insertions(+)
 create mode 100644 arch/arm64/kernel/uncore/Makefile
 create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.c
 create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.h

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 83cd7e6..c2d2810 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -42,6 +42,7 @@ arm64-obj-$(CONFIG_PCI)   += pci.o
 arm64-obj-$(CONFIG_ARMV8_DEPRECATED)   += armv8_deprecated.o
 arm64-obj-$(CONFIG_ACPI)   += acpi.o
 arm64-obj-$(CONFIG_PARAVIRT)   += paravirt.o
+arm64-obj-$(CONFIG_ARCH_THUNDER)   += uncore/
 
 obj-y  += $(arm64-obj-y) vdso/
 obj-m  += $(arm64-obj-m)
diff --git a/arch/arm64/kernel/uncore/Makefile 
b/arch/arm64/kernel/uncore/Makefile
new file mode 100644
index 000..b9c72c2
--- /dev/null
+++ b/arch/arm64/kernel/uncore/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_ARCH_THUNDER) += uncore_cavium.o
diff --git a/arch/arm64/kernel/uncore/uncore_cavium.c 
b/arch/arm64/kernel/uncore/uncore_cavium.c
new file mode 100644
index 000..0cfcc83
--- /dev/null
+++ b/arch/arm64/kernel/uncore/uncore_cavium.c
@@ -0,0 +1,210 @@
+/*
+ * Cavium Thunder uncore PMU support. Derived from Intel and AMD uncore code.
+ *
+ * Copyright (C) 2015,2016 Cavium Inc.
+ * Author: Jan Glauber 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "uncore_cavium.h"
+
+int thunder_uncore_version;
+
+struct thunder_uncore *event_to_thunder_uncore(struct perf_event *event)
+{
+   return NULL;
+}
+
+void thunder_uncore_read(struct perf_event *event)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   u64 prev, new = 0;
+   s64 delta;
+   int i;
+
+   /*
+* since we do not enable counter overflow interrupts,
+* we do not have to worry about prev_count changing on us
+*/
+
+   prev = local64_read(>prev_count);
+
+   /* read counter values from all units */
+   for (i = 0; i < uncore->nr_units; i++)
+   new += readq(map_offset(hwc->event_base, uncore, i));
+
+   local64_set(>prev_count, new);
+   delta = new - prev;
+   local64_add(delta, >count);
+}
+
+void thunder_uncore_del(struct perf_event *event, int flags)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   int i;
+
+   event->pmu->stop(event, PERF_EF_UPDATE);
+
+   for (i = 0; i < uncore->num_counters; i++) {
+   if (cmpxchg(>events[i], event, NULL) == event)
+   break;
+   }
+   hwc->idx = -1;
+}
+
+int thunder_uncore_event_init(struct perf_event *event)
+{
+   struct hw_perf_event *hwc = >hw;
+   struct thunder_uncore *uncore;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* we do not support sampling */
+   if (is_sampling_event(event))
+   return -EINVAL;
+
+   /* counters do not have these bits */
+   if (event->attr.exclude_user||
+   event->attr.exclude_kernel  ||
+   event->attr.exclude_host||
+   event->attr.exclude_guest   ||
+   event->attr.exclude_hv  ||
+   event->attr.exclude_idle)
+   return -EINVAL;
+
+   /* and we do not enable counter overflow interrupts */
+
+   uncore = event_to_thunder_uncore(event);
+   if (!uncore)
+   return -ENODEV;
+   if (!uncore->event_valid(event->attr.config))
+   return -EINVAL;
+
+   hwc->config = event->attr.config;
+   hwc->idx = -1;
+
+   /* and we don't care about 

[RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-12 Thread Jan Glauber
Provide uncore facilities for non-CPU performance counter units.
Based on Intel/AMD uncore pmu support.

The uncore PMUs can be found under /sys/bus/event_source/devices.
All counters are exported via sysfs in the corresponding events
files under the PMU directory so the perf tool can list the event names.

There are 2 points that are special in this implementation:

1) The PMU detection solely relies on PCI device detection. If a
   matching PCI device is found the PMU is created. The code can deal
   with multiple units of the same type, e.g. more than one memory
   controller.

2) Counters are summarized across the different units of the same type,
   e.g. L2C TAD 0..7 is presented as a single counter (adding the
   values from TAD 0 to 7). Although losing the ability to read a
   single value the merged values are easier to use and yield
   enough information.

Signed-off-by: Jan Glauber 
---
 arch/arm64/kernel/Makefile   |   1 +
 arch/arm64/kernel/uncore/Makefile|   1 +
 arch/arm64/kernel/uncore/uncore_cavium.c | 210 +++
 arch/arm64/kernel/uncore/uncore_cavium.h |  73 +++
 4 files changed, 285 insertions(+)
 create mode 100644 arch/arm64/kernel/uncore/Makefile
 create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.c
 create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.h

diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile
index 83cd7e6..c2d2810 100644
--- a/arch/arm64/kernel/Makefile
+++ b/arch/arm64/kernel/Makefile
@@ -42,6 +42,7 @@ arm64-obj-$(CONFIG_PCI)   += pci.o
 arm64-obj-$(CONFIG_ARMV8_DEPRECATED)   += armv8_deprecated.o
 arm64-obj-$(CONFIG_ACPI)   += acpi.o
 arm64-obj-$(CONFIG_PARAVIRT)   += paravirt.o
+arm64-obj-$(CONFIG_ARCH_THUNDER)   += uncore/
 
 obj-y  += $(arm64-obj-y) vdso/
 obj-m  += $(arm64-obj-m)
diff --git a/arch/arm64/kernel/uncore/Makefile 
b/arch/arm64/kernel/uncore/Makefile
new file mode 100644
index 000..b9c72c2
--- /dev/null
+++ b/arch/arm64/kernel/uncore/Makefile
@@ -0,0 +1 @@
+obj-$(CONFIG_ARCH_THUNDER) += uncore_cavium.o
diff --git a/arch/arm64/kernel/uncore/uncore_cavium.c 
b/arch/arm64/kernel/uncore/uncore_cavium.c
new file mode 100644
index 000..0cfcc83
--- /dev/null
+++ b/arch/arm64/kernel/uncore/uncore_cavium.c
@@ -0,0 +1,210 @@
+/*
+ * Cavium Thunder uncore PMU support. Derived from Intel and AMD uncore code.
+ *
+ * Copyright (C) 2015,2016 Cavium Inc.
+ * Author: Jan Glauber 
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+
+#include "uncore_cavium.h"
+
+int thunder_uncore_version;
+
+struct thunder_uncore *event_to_thunder_uncore(struct perf_event *event)
+{
+   return NULL;
+}
+
+void thunder_uncore_read(struct perf_event *event)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   u64 prev, new = 0;
+   s64 delta;
+   int i;
+
+   /*
+* since we do not enable counter overflow interrupts,
+* we do not have to worry about prev_count changing on us
+*/
+
+   prev = local64_read(>prev_count);
+
+   /* read counter values from all units */
+   for (i = 0; i < uncore->nr_units; i++)
+   new += readq(map_offset(hwc->event_base, uncore, i));
+
+   local64_set(>prev_count, new);
+   delta = new - prev;
+   local64_add(delta, >count);
+}
+
+void thunder_uncore_del(struct perf_event *event, int flags)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   int i;
+
+   event->pmu->stop(event, PERF_EF_UPDATE);
+
+   for (i = 0; i < uncore->num_counters; i++) {
+   if (cmpxchg(>events[i], event, NULL) == event)
+   break;
+   }
+   hwc->idx = -1;
+}
+
+int thunder_uncore_event_init(struct perf_event *event)
+{
+   struct hw_perf_event *hwc = >hw;
+   struct thunder_uncore *uncore;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* we do not support sampling */
+   if (is_sampling_event(event))
+   return -EINVAL;
+
+   /* counters do not have these bits */
+   if (event->attr.exclude_user||
+   event->attr.exclude_kernel  ||
+   event->attr.exclude_host||
+   event->attr.exclude_guest   ||
+   event->attr.exclude_hv  ||
+   event->attr.exclude_idle)
+   return -EINVAL;
+
+   /* and we do not enable counter overflow interrupts */
+
+   uncore = event_to_thunder_uncore(event);
+   if (!uncore)
+   return -ENODEV;
+   if (!uncore->event_valid(event->attr.config))
+   return -EINVAL;
+
+   hwc->config = event->attr.config;
+   

Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-12 Thread Mark Rutland
On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:
> Provide uncore facilities for non-CPU performance counter units.
> Based on Intel/AMD uncore pmu support.
> 
> The uncore PMUs can be found under /sys/bus/event_source/devices.
> All counters are exported via sysfs in the corresponding events
> files under the PMU directory so the perf tool can list the event names.

It turns out that "uncore" covers quite a lot of things.

Where exactly do the see counters live? system, socket, cluster?

Are there potentially multiple instances of a given PMU in the system?
e.g. might each clutster have an instance of an L2 PMU?

If I turn off a set of CPUs, do any "uncore" PMUs lost context or become
inaccessible?

Otherwise, are they associated with some power domain?

> There are 2 points that are special in this implementation:
> 
> 1) The PMU detection solely relies on PCI device detection. If a
>matching PCI device is found the PMU is created. The code can deal
>with multiple units of the same type, e.g. more than one memory
>controller.

I see below that the driver has an initcall that runs regardless of
whether the PCI device exists, and looks at the MIDR. That's clearly not
string PCI device detection.

Why is this not a true PCI driver that only gets probed if the PCI
device exists? 

> 2) Counters are summarized across the different units of the same type,
>e.g. L2C TAD 0..7 is presented as a single counter (adding the
>values from TAD 0 to 7). Although losing the ability to read a
>single value the merged values are easier to use and yield
>enough information.

I'm not sure I follow this. What is easier? What are you doing, and what
are you comparing that with to say that your approach is easier?

It sounds like it should be possible to handle multiple counters like
this, so I don't follow why you want to amalgamate them in-kernel.

[...]

> +#include 
> +#include 

I don't see why you should need these two if this is truly an uncore
device probed solely from PCI.

> +void thunder_uncore_read(struct perf_event *event)
> +{
> + struct thunder_uncore *uncore = event_to_thunder_uncore(event);
> + struct hw_perf_event *hwc = >hw;
> + u64 prev, new = 0;
> + s64 delta;
> + int i;
> +
> + /*
> +  * since we do not enable counter overflow interrupts,
> +  * we do not have to worry about prev_count changing on us
> +  */

Without overflow interrupts, how do you ensure that you account for
overflow in a reasonable time window (i.e. before the counter runs past
its initial value)?

> +
> + prev = local64_read(>prev_count);
> +
> + /* read counter values from all units */
> + for (i = 0; i < uncore->nr_units; i++)
> + new += readq(map_offset(hwc->event_base, uncore, i));

There's no bit to determine whether an overflow occurred?

> +
> + local64_set(>prev_count, new);
> + delta = new - prev;
> + local64_add(delta, >count);
> +}
> +
> +void thunder_uncore_del(struct perf_event *event, int flags)
> +{
> + struct thunder_uncore *uncore = event_to_thunder_uncore(event);
> + struct hw_perf_event *hwc = >hw;
> + int i;
> +
> + event->pmu->stop(event, PERF_EF_UPDATE);
> +
> + for (i = 0; i < uncore->num_counters; i++) {
> + if (cmpxchg(>events[i], event, NULL) == event)
> + break;
> + }
> + hwc->idx = -1;
> +}

Why not just place the event at uncode->events[hwc->idx] ?

Theat way removing the event is trivial.

> +int thunder_uncore_event_init(struct perf_event *event)
> +{
> + struct hw_perf_event *hwc = >hw;
> + struct thunder_uncore *uncore;
> +
> + if (event->attr.type != event->pmu->type)
> + return -ENOENT;
> +
> + /* we do not support sampling */
> + if (is_sampling_event(event))
> + return -EINVAL;
> +
> + /* counters do not have these bits */
> + if (event->attr.exclude_user||
> + event->attr.exclude_kernel  ||
> + event->attr.exclude_host||
> + event->attr.exclude_guest   ||
> + event->attr.exclude_hv  ||
> + event->attr.exclude_idle)
> + return -EINVAL;

We should _really_ make these features opt-in at the core level. It's
crazy that each and every PMU drivers has to explicitly test and reject
things it doesn't support.

> +
> + /* and we do not enable counter overflow interrupts */

That statement raises far more questions than it answers.

_why_ do we not user overflow interrupts?

> +
> + uncore = event_to_thunder_uncore(event);
> + if (!uncore)
> + return -ENODEV;
> + if (!uncore->event_valid(event->attr.config))
> + return -EINVAL;
> +
> + hwc->config = event->attr.config;
> + hwc->idx = -1;
> +
> + /* and we don't care about CPU */

Actually, you do. You want the perf core to serialize accesses via the
same CPU, so all events _must_ be targetted at the same CPU. Otherwise
there 

Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX

2016-02-12 Thread David Daney

On 02/12/2016 09:36 AM, Mark Rutland wrote:

On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote:

[...]

2) Counters are summarized across the different units of the same type,
e.g. L2C TAD 0..7 is presented as a single counter (adding the
values from TAD 0 to 7). Although losing the ability to read a
single value the merged values are easier to use and yield
enough information.


I'm not sure I follow this. What is easier? What are you doing, and what
are you comparing that with to say that your approach is easier?

It sounds like it should be possible to handle multiple counters like
this, so I don't follow why you want to amalgamate them in-kernel.



The values of the individual counters are close to meaningless.  The 
only thing that is meaningful to someone reading the counters is the 
aggregate sum of all the counts.




[...]


+#include 
+#include 


I don't see why you should need these two if this is truly an uncore
device probed solely from PCI.


+void thunder_uncore_read(struct perf_event *event)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   u64 prev, new = 0;
+   s64 delta;
+   int i;
+
+   /*
+* since we do not enable counter overflow interrupts,
+* we do not have to worry about prev_count changing on us
+*/


Without overflow interrupts, how do you ensure that you account for
overflow in a reasonable time window (i.e. before the counter runs past
its initial value)?


Two reasons:

  1) There are no interrupts.

  2) The counters are 64-bit, they never overflow.




+
+   prev = local64_read(>prev_count);
+
+   /* read counter values from all units */
+   for (i = 0; i < uncore->nr_units; i++)
+   new += readq(map_offset(hwc->event_base, uncore, i));


There's no bit to determine whether an overflow occurred?


No.





+
+   local64_set(>prev_count, new);
+   delta = new - prev;
+   local64_add(delta, >count);
+}
+
+void thunder_uncore_del(struct perf_event *event, int flags)
+{
+   struct thunder_uncore *uncore = event_to_thunder_uncore(event);
+   struct hw_perf_event *hwc = >hw;
+   int i;
+
+   event->pmu->stop(event, PERF_EF_UPDATE);
+
+   for (i = 0; i < uncore->num_counters; i++) {
+   if (cmpxchg(>events[i], event, NULL) == event)
+   break;
+   }
+   hwc->idx = -1;
+}


Why not just place the event at uncode->events[hwc->idx] ?

Theat way removing the event is trivial.


+int thunder_uncore_event_init(struct perf_event *event)
+{
+   struct hw_perf_event *hwc = >hw;
+   struct thunder_uncore *uncore;
+
+   if (event->attr.type != event->pmu->type)
+   return -ENOENT;
+
+   /* we do not support sampling */
+   if (is_sampling_event(event))
+   return -EINVAL;
+
+   /* counters do not have these bits */
+   if (event->attr.exclude_user ||
+   event->attr.exclude_kernel   ||
+   event->attr.exclude_host ||
+   event->attr.exclude_guest||
+   event->attr.exclude_hv   ||
+   event->attr.exclude_idle)
+   return -EINVAL;


We should _really_ make these features opt-in at the core level. It's
crazy that each and every PMU drivers has to explicitly test and reject
things it doesn't support.


+
+   /* and we do not enable counter overflow interrupts */


That statement raises far more questions than it answers.

_why_ do we not user overflow interrupts?


As stated above, there are *no* overflow interrupts.

The events we are counting cannot be attributed to any one (or even any) 
CPU, they run asynchronous to the CPU, so even if there were interrupts, 
they would be meaningless.



David Daney