Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: [...] > > +int thunder_uncore_event_init(struct perf_event *event) > > +{ > > + struct hw_perf_event *hwc = >hw; > > + struct thunder_uncore *uncore; > > + > > + if (event->attr.type != event->pmu->type) > > + return -ENOENT; > > + > > + /* we do not support sampling */ > > + if (is_sampling_event(event)) > > + return -EINVAL; > > + > > + /* counters do not have these bits */ > > + if (event->attr.exclude_user|| > > + event->attr.exclude_kernel || > > + event->attr.exclude_host|| > > + event->attr.exclude_guest || > > + event->attr.exclude_hv || > > + event->attr.exclude_idle) > > + return -EINVAL; > > We should _really_ make these features opt-in at the core level. It's > crazy that each and every PMU drivers has to explicitly test and reject > things it doesn't support. > Looking at the exclude_* feature handling in pmu->event_init under arch/ shows lots of differences. Just as an example, exclude_idle returns sometimes -EINVAL, sometimes -EPERM while other archs ignore it and one silently deletes the flag. So it looks hard to me to move the exclude handling into perf core while keeping the per-arch differences. And if we don't and return an error on the perf_event_open syscall in a newer kernel for an exclude bit that was previously ignored it will be a user-space regression, right? Looking only at the uncore drivers (plus drivers/bus/arm-cc*) it looks like they all don't support any exclude bits but the handling here differs also. The table shows the current behaviour, X means the requested exclude flag is refused. user kernel hostguesthv idle - x86 uncore intel| X X X X x86 uncore intel snb| X X X X X X x86 uncore intel cqm| X X X X X X x86 uncore intel cstate | X X X X X X x86 uncore intel rapl | X X X X X X x86 uncore amd | X X X X x86 uncore amd iommu| X X X X x86 uncore amd ibs | X X X X X X arm bus cci | X X X X X X arm bus ccn | X X X X -- How about simply adding a helper function to include/linux/perf_event.h that checks if _any_ of the exclude bits is set? We could then simplify the check-for-any exclude flag to: if (any_exclude_set(event)) return -EINVAL; What's your opinion? Jan --- diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index f5c5a3f..0eacdba0 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -849,6 +849,18 @@ static inline bool is_sampling_event(struct perf_event *event) return event->attr.sample_period != 0; } +static inline int any_exclude_set(struct perf_event *event) +{ + if (event->attr.exclude_user|| + event->attr.exclude_kernel || + event->attr.exclude_host|| + event->attr.exclude_guest || + event->attr.exclude_hv || + event->attr.exclude_idle) + return 1; + return 0; +} + /* * Return 1 for a software event, 0 for a hardware event */
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: [...] > > +int thunder_uncore_event_init(struct perf_event *event) > > +{ > > + struct hw_perf_event *hwc = >hw; > > + struct thunder_uncore *uncore; > > + > > + if (event->attr.type != event->pmu->type) > > + return -ENOENT; > > + > > + /* we do not support sampling */ > > + if (is_sampling_event(event)) > > + return -EINVAL; > > + > > + /* counters do not have these bits */ > > + if (event->attr.exclude_user|| > > + event->attr.exclude_kernel || > > + event->attr.exclude_host|| > > + event->attr.exclude_guest || > > + event->attr.exclude_hv || > > + event->attr.exclude_idle) > > + return -EINVAL; > > We should _really_ make these features opt-in at the core level. It's > crazy that each and every PMU drivers has to explicitly test and reject > things it doesn't support. > Looking at the exclude_* feature handling in pmu->event_init under arch/ shows lots of differences. Just as an example, exclude_idle returns sometimes -EINVAL, sometimes -EPERM while other archs ignore it and one silently deletes the flag. So it looks hard to me to move the exclude handling into perf core while keeping the per-arch differences. And if we don't and return an error on the perf_event_open syscall in a newer kernel for an exclude bit that was previously ignored it will be a user-space regression, right? Looking only at the uncore drivers (plus drivers/bus/arm-cc*) it looks like they all don't support any exclude bits but the handling here differs also. The table shows the current behaviour, X means the requested exclude flag is refused. user kernel hostguesthv idle - x86 uncore intel| X X X X x86 uncore intel snb| X X X X X X x86 uncore intel cqm| X X X X X X x86 uncore intel cstate | X X X X X X x86 uncore intel rapl | X X X X X X x86 uncore amd | X X X X x86 uncore amd iommu| X X X X x86 uncore amd ibs | X X X X X X arm bus cci | X X X X X X arm bus ccn | X X X X -- How about simply adding a helper function to include/linux/perf_event.h that checks if _any_ of the exclude bits is set? We could then simplify the check-for-any exclude flag to: if (any_exclude_set(event)) return -EINVAL; What's your opinion? Jan --- diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h index f5c5a3f..0eacdba0 100644 --- a/include/linux/perf_event.h +++ b/include/linux/perf_event.h @@ -849,6 +849,18 @@ static inline bool is_sampling_event(struct perf_event *event) return event->attr.sample_period != 0; } +static inline int any_exclude_set(struct perf_event *event) +{ + if (event->attr.exclude_user|| + event->attr.exclude_kernel || + event->attr.exclude_host|| + event->attr.exclude_guest || + event->attr.exclude_hv || + event->attr.exclude_idle) + return 1; + return 0; +} + /* * Return 1 for a software event, 0 for a hardware event */
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > Provide uncore facilities for non-CPU performance counter units. > > Based on Intel/AMD uncore pmu support. > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > All counters are exported via sysfs in the corresponding events > > files under the PMU directory so the perf tool can list the event names. > > It turns out that "uncore" covers quite a lot of things. > > Where exactly do the see counters live? system, socket, cluster? > > Are there potentially multiple instances of a given PMU in the system? > e.g. might each clutster have an instance of an L2 PMU? Thinking twice about the multi-node systems I would like to change the implementation to not merge counters across nodes. There might be value in having the counters per node and also performance wise it would be better to merge counters only on the local node. I'll introduce a node sysfs attribute that can be combined with the event and fill the node with the numa id. [...] > > Otherwise, are they associated with some power domain? No, power domains are not used for the "uncore" related devices. These devices are currently always on. Jan
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > Provide uncore facilities for non-CPU performance counter units. > > Based on Intel/AMD uncore pmu support. > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > All counters are exported via sysfs in the corresponding events > > files under the PMU directory so the perf tool can list the event names. > > It turns out that "uncore" covers quite a lot of things. > > Where exactly do the see counters live? system, socket, cluster? > > Are there potentially multiple instances of a given PMU in the system? > e.g. might each clutster have an instance of an L2 PMU? Thinking twice about the multi-node systems I would like to change the implementation to not merge counters across nodes. There might be value in having the counters per node and also performance wise it would be better to merge counters only on the local node. I'll introduce a node sysfs attribute that can be combined with the event and fill the node with the numa id. [...] > > Otherwise, are they associated with some power domain? No, power domains are not used for the "uncore" related devices. These devices are currently always on. Jan
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote: > > > > 1) The PMU detection solely relies on PCI device detection. If a > > > >matching PCI device is found the PMU is created. The code can deal > > > >with multiple units of the same type, e.g. more than one memory > > > >controller. > > > > > > I see below that the driver has an initcall that runs regardless of > > > whether the PCI device exists, and looks at the MIDR. That's clearly not > > > string PCI device detection. > > > > > > Why is this not a true PCI driver that only gets probed if the PCI > > > device exists? > > > > It is not a PCI driver because there are already drivers like edac that > > will access these PCI devices. The uncore driver only accesses the > > performance counters, which are not used by the other drivers. > > Several drivers are accessing the same device? > > That sounds somewhat scary. I've double checked that the edac drivers do not access the performance counters at all. It felt cleaner to me to keep the performance counter code seperated from edac. Jan
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote: > > > > 1) The PMU detection solely relies on PCI device detection. If a > > > >matching PCI device is found the PMU is created. The code can deal > > > >with multiple units of the same type, e.g. more than one memory > > > >controller. > > > > > > I see below that the driver has an initcall that runs regardless of > > > whether the PCI device exists, and looks at the MIDR. That's clearly not > > > string PCI device detection. > > > > > > Why is this not a true PCI driver that only gets probed if the PCI > > > device exists? > > > > It is not a PCI driver because there are already drivers like edac that > > will access these PCI devices. The uncore driver only accesses the > > performance counters, which are not used by the other drivers. > > Several drivers are accessing the same device? > > That sounds somewhat scary. I've double checked that the edac drivers do not access the performance counters at all. It felt cleaner to me to keep the performance counter code seperated from edac. Jan
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote: > > > > + uncore = event_to_thunder_uncore(event); > > > > + if (!uncore) > > > > + return -ENODEV; > > > > + if (!uncore->event_valid(event->attr.config)) > > > > + return -EINVAL; > > > > + > > > > + hwc->config = event->attr.config; > > > > + hwc->idx = -1; > > > > + > > > > + /* and we don't care about CPU */ > > > > > > Actually, you do. You want the perf core to serialize accesses via the > > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > > > there are a tonne of problems you don't even want to think about. > > > > I found that perf added the events on every CPU in the system. Because > > the uncore events are not CPU related I wanted to avoid this. Setting > > cpumask to -1 did not work. Therefore I added a single CPU in the > > cpumask, see thunder_uncore_attr_show_cpumask(). > > I understand that, which is why I wrote: > > > > You _must_ ensure this kernel-side, regardless of what the perf tool > > > happens to do. > > > > > > See the arm-cci and arm-ccn drivers for an example. > > Take a look at drivers/bus/arm-cci.c; specifically, what we do in > cci_pmu_event_init and cci_pmu_cpu_notifier. > > This is the same thing that's done for x86 system PMUs. Take a look at > uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c. I note that we still have an open TODO rather than a call to perf_pmu_migrate_context. The better example is arm_ccn_pmu_cpu_notifier in drivers/bus/arm-ccn.c. Mark.
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Mon, Feb 15, 2016 at 02:27:27PM +, Mark Rutland wrote: > > > > + uncore = event_to_thunder_uncore(event); > > > > + if (!uncore) > > > > + return -ENODEV; > > > > + if (!uncore->event_valid(event->attr.config)) > > > > + return -EINVAL; > > > > + > > > > + hwc->config = event->attr.config; > > > > + hwc->idx = -1; > > > > + > > > > + /* and we don't care about CPU */ > > > > > > Actually, you do. You want the perf core to serialize accesses via the > > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > > > there are a tonne of problems you don't even want to think about. > > > > I found that perf added the events on every CPU in the system. Because > > the uncore events are not CPU related I wanted to avoid this. Setting > > cpumask to -1 did not work. Therefore I added a single CPU in the > > cpumask, see thunder_uncore_attr_show_cpumask(). > > I understand that, which is why I wrote: > > > > You _must_ ensure this kernel-side, regardless of what the perf tool > > > happens to do. > > > > > > See the arm-cci and arm-ccn drivers for an example. > > Take a look at drivers/bus/arm-cci.c; specifically, what we do in > cci_pmu_event_init and cci_pmu_cpu_notifier. > > This is the same thing that's done for x86 system PMUs. Take a look at > uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c. I note that we still have an open TODO rather than a call to perf_pmu_migrate_context. The better example is arm_ccn_pmu_cpu_notifier in drivers/bus/arm-ccn.c. Mark.
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
Hi Mark, thanks for reviewing! I'll need several mails to address all questions. On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > Provide uncore facilities for non-CPU performance counter units. > > Based on Intel/AMD uncore pmu support. > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > All counters are exported via sysfs in the corresponding events > > files under the PMU directory so the perf tool can list the event names. > > It turns out that "uncore" covers quite a lot of things. > > Where exactly do the see counters live? system, socket, cluster? Neither cluster nor socket, so I would say system. Where a system may consist of 2 nodes that mostly appear as one system. > Are there potentially multiple instances of a given PMU in the system? > e.g. might each clutster have an instance of an L2 PMU? Yes. > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become > inaccessible? No, they are not related to CPUs. [...] > > > > 1) The PMU detection solely relies on PCI device detection. If a > >matching PCI device is found the PMU is created. The code can deal > >with multiple units of the same type, e.g. more than one memory > >controller. > > I see below that the driver has an initcall that runs regardless of > whether the PCI device exists, and looks at the MIDR. That's clearly not > string PCI device detection. > > Why is this not a true PCI driver that only gets probed if the PCI > device exists? It is not a PCI driver because there are already drivers like edac that will access these PCI devices. The uncore driver only accesses the performance counters, which are not used by the other drivers. [...] > > +#include > > +#include > > I don't see why you should need these two if this is truly an uncore > device probed solely from PCI. There are several passes of the hardware that have the same PCI device ID. Therefore I need the CPU variant to distinguish them. This could be done _after_ the PCI device is found but I found it easier to implement the check once in the common setup function. [...] > > +int thunder_uncore_event_init(struct perf_event *event) > > +{ > > + struct hw_perf_event *hwc = >hw; > > + struct thunder_uncore *uncore; > > + > > + if (event->attr.type != event->pmu->type) > > + return -ENOENT; > > + > > + /* we do not support sampling */ > > + if (is_sampling_event(event)) > > + return -EINVAL; > > + > > + /* counters do not have these bits */ > > + if (event->attr.exclude_user|| > > + event->attr.exclude_kernel || > > + event->attr.exclude_host|| > > + event->attr.exclude_guest || > > + event->attr.exclude_hv || > > + event->attr.exclude_idle) > > + return -EINVAL; > > We should _really_ make these features opt-in at the core level. It's > crazy that each and every PMU drivers has to explicitly test and reject > things it doesn't support. Completely agreed. Also, every sample code I looked at did check for other bits... [...] > > + > > + uncore = event_to_thunder_uncore(event); > > + if (!uncore) > > + return -ENODEV; > > + if (!uncore->event_valid(event->attr.config)) > > + return -EINVAL; > > + > > + hwc->config = event->attr.config; > > + hwc->idx = -1; > > + > > + /* and we don't care about CPU */ > > Actually, you do. You want the perf core to serialize accesses via the > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > there are a tonne of problems you don't even want to think about. I found that perf added the events on every CPU in the system. Because the uncore events are not CPU related I wanted to avoid this. Setting cpumask to -1 did not work. Therefore I added a single CPU in the cpumask, see thunder_uncore_attr_show_cpumask(). > You _must_ ensure this kernel-side, regardless of what the perf tool > happens to do. > > See the arm-cci and arm-ccn drivers for an example. > > You can also follow the migration approach used there to allow you to > retain counting across a hotplug. > > [...] > > > +static int __init thunder_uncore_init(void) > > +{ > > + unsigned long implementor = read_cpuid_implementor(); > > + unsigned long part_number = read_cpuid_part_number(); > > + u32 variant; > > + > > + if (implementor != ARM_CPU_IMP_CAVIUM || > > + part_number != CAVIUM_CPU_PART_THUNDERX) > > + return -ENODEV; > > + > > + /* detect pass2 which contains different counters */ > > + variant = MIDR_VARIANT(read_cpuid_id()); > > + if (variant == 1) > > + thunder_uncore_version = 1; > > + pr_info("PMU version: %d\n", thunder_uncore_version); > > + > > + return 0; > > +} > > You should call out these differences in the commmit message. OK > Mark. Thanks, Jan
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
Hi Mark, thanks for reviewing! I'll need several mails to address all questions. On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > Provide uncore facilities for non-CPU performance counter units. > > Based on Intel/AMD uncore pmu support. > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > All counters are exported via sysfs in the corresponding events > > files under the PMU directory so the perf tool can list the event names. > > It turns out that "uncore" covers quite a lot of things. > > Where exactly do the see counters live? system, socket, cluster? Neither cluster nor socket, so I would say system. Where a system may consist of 2 nodes that mostly appear as one system. > Are there potentially multiple instances of a given PMU in the system? > e.g. might each clutster have an instance of an L2 PMU? Yes. > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become > inaccessible? No, they are not related to CPUs. [...] > > > > 1) The PMU detection solely relies on PCI device detection. If a > >matching PCI device is found the PMU is created. The code can deal > >with multiple units of the same type, e.g. more than one memory > >controller. > > I see below that the driver has an initcall that runs regardless of > whether the PCI device exists, and looks at the MIDR. That's clearly not > string PCI device detection. > > Why is this not a true PCI driver that only gets probed if the PCI > device exists? It is not a PCI driver because there are already drivers like edac that will access these PCI devices. The uncore driver only accesses the performance counters, which are not used by the other drivers. [...] > > +#include > > +#include > > I don't see why you should need these two if this is truly an uncore > device probed solely from PCI. There are several passes of the hardware that have the same PCI device ID. Therefore I need the CPU variant to distinguish them. This could be done _after_ the PCI device is found but I found it easier to implement the check once in the common setup function. [...] > > +int thunder_uncore_event_init(struct perf_event *event) > > +{ > > + struct hw_perf_event *hwc = >hw; > > + struct thunder_uncore *uncore; > > + > > + if (event->attr.type != event->pmu->type) > > + return -ENOENT; > > + > > + /* we do not support sampling */ > > + if (is_sampling_event(event)) > > + return -EINVAL; > > + > > + /* counters do not have these bits */ > > + if (event->attr.exclude_user|| > > + event->attr.exclude_kernel || > > + event->attr.exclude_host|| > > + event->attr.exclude_guest || > > + event->attr.exclude_hv || > > + event->attr.exclude_idle) > > + return -EINVAL; > > We should _really_ make these features opt-in at the core level. It's > crazy that each and every PMU drivers has to explicitly test and reject > things it doesn't support. Completely agreed. Also, every sample code I looked at did check for other bits... [...] > > + > > + uncore = event_to_thunder_uncore(event); > > + if (!uncore) > > + return -ENODEV; > > + if (!uncore->event_valid(event->attr.config)) > > + return -EINVAL; > > + > > + hwc->config = event->attr.config; > > + hwc->idx = -1; > > + > > + /* and we don't care about CPU */ > > Actually, you do. You want the perf core to serialize accesses via the > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > there are a tonne of problems you don't even want to think about. I found that perf added the events on every CPU in the system. Because the uncore events are not CPU related I wanted to avoid this. Setting cpumask to -1 did not work. Therefore I added a single CPU in the cpumask, see thunder_uncore_attr_show_cpumask(). > You _must_ ensure this kernel-side, regardless of what the perf tool > happens to do. > > See the arm-cci and arm-ccn drivers for an example. > > You can also follow the migration approach used there to allow you to > retain counting across a hotplug. > > [...] > > > +static int __init thunder_uncore_init(void) > > +{ > > + unsigned long implementor = read_cpuid_implementor(); > > + unsigned long part_number = read_cpuid_part_number(); > > + u32 variant; > > + > > + if (implementor != ARM_CPU_IMP_CAVIUM || > > + part_number != CAVIUM_CPU_PART_THUNDERX) > > + return -ENODEV; > > + > > + /* detect pass2 which contains different counters */ > > + variant = MIDR_VARIANT(read_cpuid_id()); > > + if (variant == 1) > > + thunder_uncore_version = 1; > > + pr_info("PMU version: %d\n", thunder_uncore_version); > > + > > + return 0; > > +} > > You should call out these differences in the commmit message. OK > Mark. Thanks, Jan
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Mon, Feb 15, 2016 at 03:07:20PM +0100, Jan Glauber wrote: > Hi Mark, > > thanks for reviewing! I'll need several mails to address all questions. > > On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > > Provide uncore facilities for non-CPU performance counter units. > > > Based on Intel/AMD uncore pmu support. > > > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > > All counters are exported via sysfs in the corresponding events > > > files under the PMU directory so the perf tool can list the event names. > > > > It turns out that "uncore" covers quite a lot of things. > > > > Where exactly do the see counters live? system, socket, cluster? > > Neither cluster nor socket, so I would say system. Where a system may > consist of 2 nodes that mostly appear as one system. > > > Are there potentially multiple instances of a given PMU in the system? > > e.g. might each clutster have an instance of an L2 PMU? > > Yes. > > > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become > > inaccessible? > > No, they are not related to CPUs. Ok. So I should be able to concurrently hotplug random CPUs on/off while this driver is running, without issues? No registers might be clock-gated or similar? I appreciate that they are not "related" to particular CPUs as such. > > > 1) The PMU detection solely relies on PCI device detection. If a > > >matching PCI device is found the PMU is created. The code can deal > > >with multiple units of the same type, e.g. more than one memory > > >controller. > > > > I see below that the driver has an initcall that runs regardless of > > whether the PCI device exists, and looks at the MIDR. That's clearly not > > string PCI device detection. > > > > Why is this not a true PCI driver that only gets probed if the PCI > > device exists? > > It is not a PCI driver because there are already drivers like edac that > will access these PCI devices. The uncore driver only accesses the > performance counters, which are not used by the other drivers. Several drivers are accessing the same device? That sounds somewhat scary. > > > +#include > > > +#include > > > > I don't see why you should need these two if this is truly an uncore > > device probed solely from PCI. > > There are several passes of the hardware that have the same PCI device > ID. Therefore I need the CPU variant to distinguish them. This could > be done _after_ the PCI device is found but I found it easier to > implement the check once in the common setup function. Ok. Please call that out in the commit message. > > > +int thunder_uncore_event_init(struct perf_event *event) > > > +{ > > > + struct hw_perf_event *hwc = >hw; > > > + struct thunder_uncore *uncore; > > > + > > > + if (event->attr.type != event->pmu->type) > > > + return -ENOENT; > > > + > > > + /* we do not support sampling */ > > > + if (is_sampling_event(event)) > > > + return -EINVAL; > > > + > > > + /* counters do not have these bits */ > > > + if (event->attr.exclude_user|| > > > + event->attr.exclude_kernel || > > > + event->attr.exclude_host|| > > > + event->attr.exclude_guest || > > > + event->attr.exclude_hv || > > > + event->attr.exclude_idle) > > > + return -EINVAL; > > > > We should _really_ make these features opt-in at the core level. It's > > crazy that each and every PMU drivers has to explicitly test and reject > > things it doesn't support. > > Completely agreed. Also, every sample code I looked at did > check for other bits... > > [...] > > > > + > > > + uncore = event_to_thunder_uncore(event); > > > + if (!uncore) > > > + return -ENODEV; > > > + if (!uncore->event_valid(event->attr.config)) > > > + return -EINVAL; > > > + > > > + hwc->config = event->attr.config; > > > + hwc->idx = -1; > > > + > > > + /* and we don't care about CPU */ > > > > Actually, you do. You want the perf core to serialize accesses via the > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > > there are a tonne of problems you don't even want to think about. > > I found that perf added the events on every CPU in the system. Because > the uncore events are not CPU related I wanted to avoid this. Setting > cpumask to -1 did not work. Therefore I added a single CPU in the > cpumask, see thunder_uncore_attr_show_cpumask(). I understand that, which is why I wrote: > > You _must_ ensure this kernel-side, regardless of what the perf tool > > happens to do. > > > > See the arm-cci and arm-ccn drivers for an example. Take a look at drivers/bus/arm-cci.c; specifically, what we do in cci_pmu_event_init and cci_pmu_cpu_notifier. This is the same thing that's done for x86 system PMUs. Take a look at uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c. Otherwise there are a number of situations where
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Mon, Feb 15, 2016 at 03:07:20PM +0100, Jan Glauber wrote: > Hi Mark, > > thanks for reviewing! I'll need several mails to address all questions. > > On Fri, Feb 12, 2016 at 05:36:59PM +, Mark Rutland wrote: > > On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > > > Provide uncore facilities for non-CPU performance counter units. > > > Based on Intel/AMD uncore pmu support. > > > > > > The uncore PMUs can be found under /sys/bus/event_source/devices. > > > All counters are exported via sysfs in the corresponding events > > > files under the PMU directory so the perf tool can list the event names. > > > > It turns out that "uncore" covers quite a lot of things. > > > > Where exactly do the see counters live? system, socket, cluster? > > Neither cluster nor socket, so I would say system. Where a system may > consist of 2 nodes that mostly appear as one system. > > > Are there potentially multiple instances of a given PMU in the system? > > e.g. might each clutster have an instance of an L2 PMU? > > Yes. > > > If I turn off a set of CPUs, do any "uncore" PMUs lost context or become > > inaccessible? > > No, they are not related to CPUs. Ok. So I should be able to concurrently hotplug random CPUs on/off while this driver is running, without issues? No registers might be clock-gated or similar? I appreciate that they are not "related" to particular CPUs as such. > > > 1) The PMU detection solely relies on PCI device detection. If a > > >matching PCI device is found the PMU is created. The code can deal > > >with multiple units of the same type, e.g. more than one memory > > >controller. > > > > I see below that the driver has an initcall that runs regardless of > > whether the PCI device exists, and looks at the MIDR. That's clearly not > > string PCI device detection. > > > > Why is this not a true PCI driver that only gets probed if the PCI > > device exists? > > It is not a PCI driver because there are already drivers like edac that > will access these PCI devices. The uncore driver only accesses the > performance counters, which are not used by the other drivers. Several drivers are accessing the same device? That sounds somewhat scary. > > > +#include > > > +#include > > > > I don't see why you should need these two if this is truly an uncore > > device probed solely from PCI. > > There are several passes of the hardware that have the same PCI device > ID. Therefore I need the CPU variant to distinguish them. This could > be done _after_ the PCI device is found but I found it easier to > implement the check once in the common setup function. Ok. Please call that out in the commit message. > > > +int thunder_uncore_event_init(struct perf_event *event) > > > +{ > > > + struct hw_perf_event *hwc = >hw; > > > + struct thunder_uncore *uncore; > > > + > > > + if (event->attr.type != event->pmu->type) > > > + return -ENOENT; > > > + > > > + /* we do not support sampling */ > > > + if (is_sampling_event(event)) > > > + return -EINVAL; > > > + > > > + /* counters do not have these bits */ > > > + if (event->attr.exclude_user|| > > > + event->attr.exclude_kernel || > > > + event->attr.exclude_host|| > > > + event->attr.exclude_guest || > > > + event->attr.exclude_hv || > > > + event->attr.exclude_idle) > > > + return -EINVAL; > > > > We should _really_ make these features opt-in at the core level. It's > > crazy that each and every PMU drivers has to explicitly test and reject > > things it doesn't support. > > Completely agreed. Also, every sample code I looked at did > check for other bits... > > [...] > > > > + > > > + uncore = event_to_thunder_uncore(event); > > > + if (!uncore) > > > + return -ENODEV; > > > + if (!uncore->event_valid(event->attr.config)) > > > + return -EINVAL; > > > + > > > + hwc->config = event->attr.config; > > > + hwc->idx = -1; > > > + > > > + /* and we don't care about CPU */ > > > > Actually, you do. You want the perf core to serialize accesses via the > > same CPU, so all events _must_ be targetted at the same CPU. Otherwise > > there are a tonne of problems you don't even want to think about. > > I found that perf added the events on every CPU in the system. Because > the uncore events are not CPU related I wanted to avoid this. Setting > cpumask to -1 did not work. Therefore I added a single CPU in the > cpumask, see thunder_uncore_attr_show_cpumask(). I understand that, which is why I wrote: > > You _must_ ensure this kernel-side, regardless of what the perf tool > > happens to do. > > > > See the arm-cci and arm-ccn drivers for an example. Take a look at drivers/bus/arm-cci.c; specifically, what we do in cci_pmu_event_init and cci_pmu_cpu_notifier. This is the same thing that's done for x86 system PMUs. Take a look at uncore_pmu_event_init in arch/x86/kernel/cpu/perf_event_intel_uncore.c. Otherwise there are a number of situations where
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:47:25PM -0800, David Daney wrote: > On 02/12/2016 09:36 AM, Mark Rutland wrote: > >On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > [...] > >>2) Counters are summarized across the different units of the same type, > >>e.g. L2C TAD 0..7 is presented as a single counter (adding the > >>values from TAD 0 to 7). Although losing the ability to read a > >>single value the merged values are easier to use and yield > >>enough information. > > > >I'm not sure I follow this. What is easier? What are you doing, and what > >are you comparing that with to say that your approach is easier? > > > >It sounds like it should be possible to handle multiple counters like > >this, so I don't follow why you want to amalgamate them in-kernel. > > > > The values of the individual counters are close to meaningless. The > only thing that is meaningful to someone reading the counters is the > aggregate sum of all the counts. I obviously know nowhere near enough about your system to say with certainty, but it may turn out that being able to track counters individually is useful for some profiling/debugging scenario. How meaningful the individual counts are really depends on what you're trying to figure out. If you believe that aggregate values are sufficient, then I'm happy to leave that as-is. > >>+void thunder_uncore_read(struct perf_event *event) > >>+{ > >>+ struct thunder_uncore *uncore = event_to_thunder_uncore(event); > >>+ struct hw_perf_event *hwc = >hw; > >>+ u64 prev, new = 0; > >>+ s64 delta; > >>+ int i; > >>+ > >>+ /* > >>+* since we do not enable counter overflow interrupts, > >>+* we do not have to worry about prev_count changing on us > >>+*/ > > > >Without overflow interrupts, how do you ensure that you account for > >overflow in a reasonable time window (i.e. before the counter runs past > >its initial value)? > > Two reasons: > > 1) There are no interrupts. > > 2) The counters are 64-bit, they never overflow. Ok. Please point this out in the comment so that reviewers aren't misled. Stating that we don't enable an interrupt implies that said interrupt exists. > >>+ /* and we do not enable counter overflow interrupts */ > > > >That statement raises far more questions than it answers. > > > >_why_ do we not user overflow interrupts? > > As stated above, there are *no* overflow interrupts. Ok. As stated above, please fix this comment to not mislead. > The events we are counting cannot be attributed to any one (or even > any) CPU, they run asynchronous to the CPU, so even if there were > interrupts, they would be meaningless. Yes, they are meaningless w.r.t. the state of an arbitrary CPU. Were they to exist you could use them to drive other snapshotting of the state of the uncore PMU, to get an idea of the frequency/stability of events over time, etc. Userspace might then decide to snapshot other whole system state based on events fed to it. That's moot if they don't exist. Mark.
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:47:25PM -0800, David Daney wrote: > On 02/12/2016 09:36 AM, Mark Rutland wrote: > >On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > [...] > >>2) Counters are summarized across the different units of the same type, > >>e.g. L2C TAD 0..7 is presented as a single counter (adding the > >>values from TAD 0 to 7). Although losing the ability to read a > >>single value the merged values are easier to use and yield > >>enough information. > > > >I'm not sure I follow this. What is easier? What are you doing, and what > >are you comparing that with to say that your approach is easier? > > > >It sounds like it should be possible to handle multiple counters like > >this, so I don't follow why you want to amalgamate them in-kernel. > > > > The values of the individual counters are close to meaningless. The > only thing that is meaningful to someone reading the counters is the > aggregate sum of all the counts. I obviously know nowhere near enough about your system to say with certainty, but it may turn out that being able to track counters individually is useful for some profiling/debugging scenario. How meaningful the individual counts are really depends on what you're trying to figure out. If you believe that aggregate values are sufficient, then I'm happy to leave that as-is. > >>+void thunder_uncore_read(struct perf_event *event) > >>+{ > >>+ struct thunder_uncore *uncore = event_to_thunder_uncore(event); > >>+ struct hw_perf_event *hwc = >hw; > >>+ u64 prev, new = 0; > >>+ s64 delta; > >>+ int i; > >>+ > >>+ /* > >>+* since we do not enable counter overflow interrupts, > >>+* we do not have to worry about prev_count changing on us > >>+*/ > > > >Without overflow interrupts, how do you ensure that you account for > >overflow in a reasonable time window (i.e. before the counter runs past > >its initial value)? > > Two reasons: > > 1) There are no interrupts. > > 2) The counters are 64-bit, they never overflow. Ok. Please point this out in the comment so that reviewers aren't misled. Stating that we don't enable an interrupt implies that said interrupt exists. > >>+ /* and we do not enable counter overflow interrupts */ > > > >That statement raises far more questions than it answers. > > > >_why_ do we not user overflow interrupts? > > As stated above, there are *no* overflow interrupts. Ok. As stated above, please fix this comment to not mislead. > The events we are counting cannot be attributed to any one (or even > any) CPU, they run asynchronous to the CPU, so even if there were > interrupts, they would be meaningless. Yes, they are meaningless w.r.t. the state of an arbitrary CPU. Were they to exist you could use them to drive other snapshotting of the state of the uncore PMU, to get an idea of the frequency/stability of events over time, etc. Userspace might then decide to snapshot other whole system state based on events fed to it. That's moot if they don't exist. Mark.
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On 02/12/2016 09:36 AM, Mark Rutland wrote: On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: [...] 2) Counters are summarized across the different units of the same type, e.g. L2C TAD 0..7 is presented as a single counter (adding the values from TAD 0 to 7). Although losing the ability to read a single value the merged values are easier to use and yield enough information. I'm not sure I follow this. What is easier? What are you doing, and what are you comparing that with to say that your approach is easier? It sounds like it should be possible to handle multiple counters like this, so I don't follow why you want to amalgamate them in-kernel. The values of the individual counters are close to meaningless. The only thing that is meaningful to someone reading the counters is the aggregate sum of all the counts. [...] +#include +#include I don't see why you should need these two if this is truly an uncore device probed solely from PCI. +void thunder_uncore_read(struct perf_event *event) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + u64 prev, new = 0; + s64 delta; + int i; + + /* +* since we do not enable counter overflow interrupts, +* we do not have to worry about prev_count changing on us +*/ Without overflow interrupts, how do you ensure that you account for overflow in a reasonable time window (i.e. before the counter runs past its initial value)? Two reasons: 1) There are no interrupts. 2) The counters are 64-bit, they never overflow. + + prev = local64_read(>prev_count); + + /* read counter values from all units */ + for (i = 0; i < uncore->nr_units; i++) + new += readq(map_offset(hwc->event_base, uncore, i)); There's no bit to determine whether an overflow occurred? No. + + local64_set(>prev_count, new); + delta = new - prev; + local64_add(delta, >count); +} + +void thunder_uncore_del(struct perf_event *event, int flags) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + int i; + + event->pmu->stop(event, PERF_EF_UPDATE); + + for (i = 0; i < uncore->num_counters; i++) { + if (cmpxchg(>events[i], event, NULL) == event) + break; + } + hwc->idx = -1; +} Why not just place the event at uncode->events[hwc->idx] ? Theat way removing the event is trivial. +int thunder_uncore_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = >hw; + struct thunder_uncore *uncore; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* we do not support sampling */ + if (is_sampling_event(event)) + return -EINVAL; + + /* counters do not have these bits */ + if (event->attr.exclude_user || + event->attr.exclude_kernel || + event->attr.exclude_host || + event->attr.exclude_guest|| + event->attr.exclude_hv || + event->attr.exclude_idle) + return -EINVAL; We should _really_ make these features opt-in at the core level. It's crazy that each and every PMU drivers has to explicitly test and reject things it doesn't support. + + /* and we do not enable counter overflow interrupts */ That statement raises far more questions than it answers. _why_ do we not user overflow interrupts? As stated above, there are *no* overflow interrupts. The events we are counting cannot be attributed to any one (or even any) CPU, they run asynchronous to the CPU, so even if there were interrupts, they would be meaningless. David Daney
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > Provide uncore facilities for non-CPU performance counter units. > Based on Intel/AMD uncore pmu support. > > The uncore PMUs can be found under /sys/bus/event_source/devices. > All counters are exported via sysfs in the corresponding events > files under the PMU directory so the perf tool can list the event names. It turns out that "uncore" covers quite a lot of things. Where exactly do the see counters live? system, socket, cluster? Are there potentially multiple instances of a given PMU in the system? e.g. might each clutster have an instance of an L2 PMU? If I turn off a set of CPUs, do any "uncore" PMUs lost context or become inaccessible? Otherwise, are they associated with some power domain? > There are 2 points that are special in this implementation: > > 1) The PMU detection solely relies on PCI device detection. If a >matching PCI device is found the PMU is created. The code can deal >with multiple units of the same type, e.g. more than one memory >controller. I see below that the driver has an initcall that runs regardless of whether the PCI device exists, and looks at the MIDR. That's clearly not string PCI device detection. Why is this not a true PCI driver that only gets probed if the PCI device exists? > 2) Counters are summarized across the different units of the same type, >e.g. L2C TAD 0..7 is presented as a single counter (adding the >values from TAD 0 to 7). Although losing the ability to read a >single value the merged values are easier to use and yield >enough information. I'm not sure I follow this. What is easier? What are you doing, and what are you comparing that with to say that your approach is easier? It sounds like it should be possible to handle multiple counters like this, so I don't follow why you want to amalgamate them in-kernel. [...] > +#include > +#include I don't see why you should need these two if this is truly an uncore device probed solely from PCI. > +void thunder_uncore_read(struct perf_event *event) > +{ > + struct thunder_uncore *uncore = event_to_thunder_uncore(event); > + struct hw_perf_event *hwc = >hw; > + u64 prev, new = 0; > + s64 delta; > + int i; > + > + /* > + * since we do not enable counter overflow interrupts, > + * we do not have to worry about prev_count changing on us > + */ Without overflow interrupts, how do you ensure that you account for overflow in a reasonable time window (i.e. before the counter runs past its initial value)? > + > + prev = local64_read(>prev_count); > + > + /* read counter values from all units */ > + for (i = 0; i < uncore->nr_units; i++) > + new += readq(map_offset(hwc->event_base, uncore, i)); There's no bit to determine whether an overflow occurred? > + > + local64_set(>prev_count, new); > + delta = new - prev; > + local64_add(delta, >count); > +} > + > +void thunder_uncore_del(struct perf_event *event, int flags) > +{ > + struct thunder_uncore *uncore = event_to_thunder_uncore(event); > + struct hw_perf_event *hwc = >hw; > + int i; > + > + event->pmu->stop(event, PERF_EF_UPDATE); > + > + for (i = 0; i < uncore->num_counters; i++) { > + if (cmpxchg(>events[i], event, NULL) == event) > + break; > + } > + hwc->idx = -1; > +} Why not just place the event at uncode->events[hwc->idx] ? Theat way removing the event is trivial. > +int thunder_uncore_event_init(struct perf_event *event) > +{ > + struct hw_perf_event *hwc = >hw; > + struct thunder_uncore *uncore; > + > + if (event->attr.type != event->pmu->type) > + return -ENOENT; > + > + /* we do not support sampling */ > + if (is_sampling_event(event)) > + return -EINVAL; > + > + /* counters do not have these bits */ > + if (event->attr.exclude_user|| > + event->attr.exclude_kernel || > + event->attr.exclude_host|| > + event->attr.exclude_guest || > + event->attr.exclude_hv || > + event->attr.exclude_idle) > + return -EINVAL; We should _really_ make these features opt-in at the core level. It's crazy that each and every PMU drivers has to explicitly test and reject things it doesn't support. > + > + /* and we do not enable counter overflow interrupts */ That statement raises far more questions than it answers. _why_ do we not user overflow interrupts? > + > + uncore = event_to_thunder_uncore(event); > + if (!uncore) > + return -ENODEV; > + if (!uncore->event_valid(event->attr.config)) > + return -EINVAL; > + > + hwc->config = event->attr.config; > + hwc->idx = -1; > + > + /* and we don't care about CPU */ Actually, you do. You want the perf core to serialize accesses via the same CPU, so all events _must_ be targetted at the same CPU. Otherwise there
[RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
Provide uncore facilities for non-CPU performance counter units. Based on Intel/AMD uncore pmu support. The uncore PMUs can be found under /sys/bus/event_source/devices. All counters are exported via sysfs in the corresponding events files under the PMU directory so the perf tool can list the event names. There are 2 points that are special in this implementation: 1) The PMU detection solely relies on PCI device detection. If a matching PCI device is found the PMU is created. The code can deal with multiple units of the same type, e.g. more than one memory controller. 2) Counters are summarized across the different units of the same type, e.g. L2C TAD 0..7 is presented as a single counter (adding the values from TAD 0 to 7). Although losing the ability to read a single value the merged values are easier to use and yield enough information. Signed-off-by: Jan Glauber --- arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/uncore/Makefile| 1 + arch/arm64/kernel/uncore/uncore_cavium.c | 210 +++ arch/arm64/kernel/uncore/uncore_cavium.h | 73 +++ 4 files changed, 285 insertions(+) create mode 100644 arch/arm64/kernel/uncore/Makefile create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.c create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.h diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 83cd7e6..c2d2810 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -42,6 +42,7 @@ arm64-obj-$(CONFIG_PCI) += pci.o arm64-obj-$(CONFIG_ARMV8_DEPRECATED) += armv8_deprecated.o arm64-obj-$(CONFIG_ACPI) += acpi.o arm64-obj-$(CONFIG_PARAVIRT) += paravirt.o +arm64-obj-$(CONFIG_ARCH_THUNDER) += uncore/ obj-y += $(arm64-obj-y) vdso/ obj-m += $(arm64-obj-m) diff --git a/arch/arm64/kernel/uncore/Makefile b/arch/arm64/kernel/uncore/Makefile new file mode 100644 index 000..b9c72c2 --- /dev/null +++ b/arch/arm64/kernel/uncore/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ARCH_THUNDER) += uncore_cavium.o diff --git a/arch/arm64/kernel/uncore/uncore_cavium.c b/arch/arm64/kernel/uncore/uncore_cavium.c new file mode 100644 index 000..0cfcc83 --- /dev/null +++ b/arch/arm64/kernel/uncore/uncore_cavium.c @@ -0,0 +1,210 @@ +/* + * Cavium Thunder uncore PMU support. Derived from Intel and AMD uncore code. + * + * Copyright (C) 2015,2016 Cavium Inc. + * Author: Jan Glauber + */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "uncore_cavium.h" + +int thunder_uncore_version; + +struct thunder_uncore *event_to_thunder_uncore(struct perf_event *event) +{ + return NULL; +} + +void thunder_uncore_read(struct perf_event *event) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + u64 prev, new = 0; + s64 delta; + int i; + + /* +* since we do not enable counter overflow interrupts, +* we do not have to worry about prev_count changing on us +*/ + + prev = local64_read(>prev_count); + + /* read counter values from all units */ + for (i = 0; i < uncore->nr_units; i++) + new += readq(map_offset(hwc->event_base, uncore, i)); + + local64_set(>prev_count, new); + delta = new - prev; + local64_add(delta, >count); +} + +void thunder_uncore_del(struct perf_event *event, int flags) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + int i; + + event->pmu->stop(event, PERF_EF_UPDATE); + + for (i = 0; i < uncore->num_counters; i++) { + if (cmpxchg(>events[i], event, NULL) == event) + break; + } + hwc->idx = -1; +} + +int thunder_uncore_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = >hw; + struct thunder_uncore *uncore; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* we do not support sampling */ + if (is_sampling_event(event)) + return -EINVAL; + + /* counters do not have these bits */ + if (event->attr.exclude_user|| + event->attr.exclude_kernel || + event->attr.exclude_host|| + event->attr.exclude_guest || + event->attr.exclude_hv || + event->attr.exclude_idle) + return -EINVAL; + + /* and we do not enable counter overflow interrupts */ + + uncore = event_to_thunder_uncore(event); + if (!uncore) + return -ENODEV; + if (!uncore->event_valid(event->attr.config)) + return -EINVAL; + + hwc->config = event->attr.config; + hwc->idx = -1; + + /* and we don't care about
[RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
Provide uncore facilities for non-CPU performance counter units. Based on Intel/AMD uncore pmu support. The uncore PMUs can be found under /sys/bus/event_source/devices. All counters are exported via sysfs in the corresponding events files under the PMU directory so the perf tool can list the event names. There are 2 points that are special in this implementation: 1) The PMU detection solely relies on PCI device detection. If a matching PCI device is found the PMU is created. The code can deal with multiple units of the same type, e.g. more than one memory controller. 2) Counters are summarized across the different units of the same type, e.g. L2C TAD 0..7 is presented as a single counter (adding the values from TAD 0 to 7). Although losing the ability to read a single value the merged values are easier to use and yield enough information. Signed-off-by: Jan Glauber--- arch/arm64/kernel/Makefile | 1 + arch/arm64/kernel/uncore/Makefile| 1 + arch/arm64/kernel/uncore/uncore_cavium.c | 210 +++ arch/arm64/kernel/uncore/uncore_cavium.h | 73 +++ 4 files changed, 285 insertions(+) create mode 100644 arch/arm64/kernel/uncore/Makefile create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.c create mode 100644 arch/arm64/kernel/uncore/uncore_cavium.h diff --git a/arch/arm64/kernel/Makefile b/arch/arm64/kernel/Makefile index 83cd7e6..c2d2810 100644 --- a/arch/arm64/kernel/Makefile +++ b/arch/arm64/kernel/Makefile @@ -42,6 +42,7 @@ arm64-obj-$(CONFIG_PCI) += pci.o arm64-obj-$(CONFIG_ARMV8_DEPRECATED) += armv8_deprecated.o arm64-obj-$(CONFIG_ACPI) += acpi.o arm64-obj-$(CONFIG_PARAVIRT) += paravirt.o +arm64-obj-$(CONFIG_ARCH_THUNDER) += uncore/ obj-y += $(arm64-obj-y) vdso/ obj-m += $(arm64-obj-m) diff --git a/arch/arm64/kernel/uncore/Makefile b/arch/arm64/kernel/uncore/Makefile new file mode 100644 index 000..b9c72c2 --- /dev/null +++ b/arch/arm64/kernel/uncore/Makefile @@ -0,0 +1 @@ +obj-$(CONFIG_ARCH_THUNDER) += uncore_cavium.o diff --git a/arch/arm64/kernel/uncore/uncore_cavium.c b/arch/arm64/kernel/uncore/uncore_cavium.c new file mode 100644 index 000..0cfcc83 --- /dev/null +++ b/arch/arm64/kernel/uncore/uncore_cavium.c @@ -0,0 +1,210 @@ +/* + * Cavium Thunder uncore PMU support. Derived from Intel and AMD uncore code. + * + * Copyright (C) 2015,2016 Cavium Inc. + * Author: Jan Glauber + */ + +#include +#include +#include +#include +#include +#include +#include + +#include +#include + +#include "uncore_cavium.h" + +int thunder_uncore_version; + +struct thunder_uncore *event_to_thunder_uncore(struct perf_event *event) +{ + return NULL; +} + +void thunder_uncore_read(struct perf_event *event) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + u64 prev, new = 0; + s64 delta; + int i; + + /* +* since we do not enable counter overflow interrupts, +* we do not have to worry about prev_count changing on us +*/ + + prev = local64_read(>prev_count); + + /* read counter values from all units */ + for (i = 0; i < uncore->nr_units; i++) + new += readq(map_offset(hwc->event_base, uncore, i)); + + local64_set(>prev_count, new); + delta = new - prev; + local64_add(delta, >count); +} + +void thunder_uncore_del(struct perf_event *event, int flags) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + int i; + + event->pmu->stop(event, PERF_EF_UPDATE); + + for (i = 0; i < uncore->num_counters; i++) { + if (cmpxchg(>events[i], event, NULL) == event) + break; + } + hwc->idx = -1; +} + +int thunder_uncore_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = >hw; + struct thunder_uncore *uncore; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* we do not support sampling */ + if (is_sampling_event(event)) + return -EINVAL; + + /* counters do not have these bits */ + if (event->attr.exclude_user|| + event->attr.exclude_kernel || + event->attr.exclude_host|| + event->attr.exclude_guest || + event->attr.exclude_hv || + event->attr.exclude_idle) + return -EINVAL; + + /* and we do not enable counter overflow interrupts */ + + uncore = event_to_thunder_uncore(event); + if (!uncore) + return -ENODEV; + if (!uncore->event_valid(event->attr.config)) + return -EINVAL; + + hwc->config = event->attr.config; +
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: > Provide uncore facilities for non-CPU performance counter units. > Based on Intel/AMD uncore pmu support. > > The uncore PMUs can be found under /sys/bus/event_source/devices. > All counters are exported via sysfs in the corresponding events > files under the PMU directory so the perf tool can list the event names. It turns out that "uncore" covers quite a lot of things. Where exactly do the see counters live? system, socket, cluster? Are there potentially multiple instances of a given PMU in the system? e.g. might each clutster have an instance of an L2 PMU? If I turn off a set of CPUs, do any "uncore" PMUs lost context or become inaccessible? Otherwise, are they associated with some power domain? > There are 2 points that are special in this implementation: > > 1) The PMU detection solely relies on PCI device detection. If a >matching PCI device is found the PMU is created. The code can deal >with multiple units of the same type, e.g. more than one memory >controller. I see below that the driver has an initcall that runs regardless of whether the PCI device exists, and looks at the MIDR. That's clearly not string PCI device detection. Why is this not a true PCI driver that only gets probed if the PCI device exists? > 2) Counters are summarized across the different units of the same type, >e.g. L2C TAD 0..7 is presented as a single counter (adding the >values from TAD 0 to 7). Although losing the ability to read a >single value the merged values are easier to use and yield >enough information. I'm not sure I follow this. What is easier? What are you doing, and what are you comparing that with to say that your approach is easier? It sounds like it should be possible to handle multiple counters like this, so I don't follow why you want to amalgamate them in-kernel. [...] > +#include > +#include I don't see why you should need these two if this is truly an uncore device probed solely from PCI. > +void thunder_uncore_read(struct perf_event *event) > +{ > + struct thunder_uncore *uncore = event_to_thunder_uncore(event); > + struct hw_perf_event *hwc = >hw; > + u64 prev, new = 0; > + s64 delta; > + int i; > + > + /* > + * since we do not enable counter overflow interrupts, > + * we do not have to worry about prev_count changing on us > + */ Without overflow interrupts, how do you ensure that you account for overflow in a reasonable time window (i.e. before the counter runs past its initial value)? > + > + prev = local64_read(>prev_count); > + > + /* read counter values from all units */ > + for (i = 0; i < uncore->nr_units; i++) > + new += readq(map_offset(hwc->event_base, uncore, i)); There's no bit to determine whether an overflow occurred? > + > + local64_set(>prev_count, new); > + delta = new - prev; > + local64_add(delta, >count); > +} > + > +void thunder_uncore_del(struct perf_event *event, int flags) > +{ > + struct thunder_uncore *uncore = event_to_thunder_uncore(event); > + struct hw_perf_event *hwc = >hw; > + int i; > + > + event->pmu->stop(event, PERF_EF_UPDATE); > + > + for (i = 0; i < uncore->num_counters; i++) { > + if (cmpxchg(>events[i], event, NULL) == event) > + break; > + } > + hwc->idx = -1; > +} Why not just place the event at uncode->events[hwc->idx] ? Theat way removing the event is trivial. > +int thunder_uncore_event_init(struct perf_event *event) > +{ > + struct hw_perf_event *hwc = >hw; > + struct thunder_uncore *uncore; > + > + if (event->attr.type != event->pmu->type) > + return -ENOENT; > + > + /* we do not support sampling */ > + if (is_sampling_event(event)) > + return -EINVAL; > + > + /* counters do not have these bits */ > + if (event->attr.exclude_user|| > + event->attr.exclude_kernel || > + event->attr.exclude_host|| > + event->attr.exclude_guest || > + event->attr.exclude_hv || > + event->attr.exclude_idle) > + return -EINVAL; We should _really_ make these features opt-in at the core level. It's crazy that each and every PMU drivers has to explicitly test and reject things it doesn't support. > + > + /* and we do not enable counter overflow interrupts */ That statement raises far more questions than it answers. _why_ do we not user overflow interrupts? > + > + uncore = event_to_thunder_uncore(event); > + if (!uncore) > + return -ENODEV; > + if (!uncore->event_valid(event->attr.config)) > + return -EINVAL; > + > + hwc->config = event->attr.config; > + hwc->idx = -1; > + > + /* and we don't care about CPU */ Actually, you do. You want the perf core to serialize accesses via the same CPU, so all events _must_ be targetted at the same CPU. Otherwise there
Re: [RFC PATCH 1/7] arm64/perf: Basic uncore counter support for Cavium ThunderX
On 02/12/2016 09:36 AM, Mark Rutland wrote: On Fri, Feb 12, 2016 at 05:55:06PM +0100, Jan Glauber wrote: [...] 2) Counters are summarized across the different units of the same type, e.g. L2C TAD 0..7 is presented as a single counter (adding the values from TAD 0 to 7). Although losing the ability to read a single value the merged values are easier to use and yield enough information. I'm not sure I follow this. What is easier? What are you doing, and what are you comparing that with to say that your approach is easier? It sounds like it should be possible to handle multiple counters like this, so I don't follow why you want to amalgamate them in-kernel. The values of the individual counters are close to meaningless. The only thing that is meaningful to someone reading the counters is the aggregate sum of all the counts. [...] +#include +#include I don't see why you should need these two if this is truly an uncore device probed solely from PCI. +void thunder_uncore_read(struct perf_event *event) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + u64 prev, new = 0; + s64 delta; + int i; + + /* +* since we do not enable counter overflow interrupts, +* we do not have to worry about prev_count changing on us +*/ Without overflow interrupts, how do you ensure that you account for overflow in a reasonable time window (i.e. before the counter runs past its initial value)? Two reasons: 1) There are no interrupts. 2) The counters are 64-bit, they never overflow. + + prev = local64_read(>prev_count); + + /* read counter values from all units */ + for (i = 0; i < uncore->nr_units; i++) + new += readq(map_offset(hwc->event_base, uncore, i)); There's no bit to determine whether an overflow occurred? No. + + local64_set(>prev_count, new); + delta = new - prev; + local64_add(delta, >count); +} + +void thunder_uncore_del(struct perf_event *event, int flags) +{ + struct thunder_uncore *uncore = event_to_thunder_uncore(event); + struct hw_perf_event *hwc = >hw; + int i; + + event->pmu->stop(event, PERF_EF_UPDATE); + + for (i = 0; i < uncore->num_counters; i++) { + if (cmpxchg(>events[i], event, NULL) == event) + break; + } + hwc->idx = -1; +} Why not just place the event at uncode->events[hwc->idx] ? Theat way removing the event is trivial. +int thunder_uncore_event_init(struct perf_event *event) +{ + struct hw_perf_event *hwc = >hw; + struct thunder_uncore *uncore; + + if (event->attr.type != event->pmu->type) + return -ENOENT; + + /* we do not support sampling */ + if (is_sampling_event(event)) + return -EINVAL; + + /* counters do not have these bits */ + if (event->attr.exclude_user || + event->attr.exclude_kernel || + event->attr.exclude_host || + event->attr.exclude_guest|| + event->attr.exclude_hv || + event->attr.exclude_idle) + return -EINVAL; We should _really_ make these features opt-in at the core level. It's crazy that each and every PMU drivers has to explicitly test and reject things it doesn't support. + + /* and we do not enable counter overflow interrupts */ That statement raises far more questions than it answers. _why_ do we not user overflow interrupts? As stated above, there are *no* overflow interrupts. The events we are counting cannot be attributed to any one (or even any) CPU, they run asynchronous to the CPU, so even if there were interrupts, they would be meaningless. David Daney