I've routinely done a similar thing on Barcelonas. Given the lack of dedicated uncore counters on the AMD, a system-wide session started by a process pinned to one core measures 4 events in the shared resources (L3 cache, memory controller, HT). A second session constrained to threads that run on the other three cores measures a set of on-core events.
The semantics of uncore events are such that trying to associate these events to a specific process/thread is of limited use and generates more confusion than useful information. For example, suppose I count L3 misses in a particular region of one thread, either using calipers or sampling. The counter will be incremented by any miss across the chip, but only while that thread is running. The count (or number of samples) attributed to any section of the code thus will have no relation to the behavior of the code in that section. The count seen by any thread is also only a lower bound on the total number of events that occured. In the AMD implementation, Perfmon will let one look at L3 misses in all the threads/processes in an application. In that case, each time there is a miss, all of the counters in the active threads are incremented. Thus, the maximum number of misses seen by any thread is still a lower bound on the global number. If at least one thread in an application is active from start to finish, then the sum of the counts across all threads is a very weak upper bound that is also less than T times the global count. To reiterate, not every event is seen, many events are seen multiple times, and there's at best a very weak link between events and code elements. If the number of events in the shared resources is high and your performance is bad, this is a problem only if the event rate is high enough to be a bottleneck, i.e., the resource is fully utilized, and the thread you are measuring is being delayed because it is waiting on the resource. This requires quantitatiave accuracy. On the Intels, I'd like to see the option of creating a system-wide "uncore" session from one core (e.g., from a process pinned to a socket or a core on a socket), and independently creating other sessions that use on-core events/counters. It would be gravy if the "uncore" session could allow other threads on the socket to examine, but not write or control the uncore counters. Dan Terpstra wrote: > Interestingly, there's a guy at ZIH Dresden who implemented a PAPI-C > component specifically to measure events on the uncore. He never tried to > measure both per-thread and uncore at the same time, and I doubt that it > would work, but found it intriguing that he was able to get reasonable data. > Like the dancing dog, it's not how well he dances, but that he dances at > all... > - d > >> -----Original Message----- >> From: stephane eranian [mailto:eran...@googlemail.com] >> Sent: Wednesday, September 23, 2009 5:12 AM >> To: gary.m...@bull.com >> Cc: perfmon2-devel@lists.sourceforge.net >> Subject: Re: [perfmon2] Monitoring core and uncore events in the same >> testrun. >> >> Gary, >> >> Sorry for the delay. >> >> The reason there is a restriction with uncore PMU is because it is shared >> by all cores on the socket. Given the model used by perfmon, i.e., event >> are assigned to counters in user space, the kernel needs to enforce some >> access control to ensure no two sessions try to use the same resource, >> here uncore registers. >> >> The current implementation uses a coarse-grain access control policy: >> - only system-wide sessions can access uncore PMU >> - the first session to access uncore PMU, grabs it all >> >> The core and uncore PMU do not share any resource except the interrupt >> vector. Theoretically we could allow distinct uncore and core sessions. >> >> Some people have also argued that allowing uncore access to per-thread >> sessions may also be beneficial. The reason being that you'd want to know >> what is going on around you. It could be hinting at what you are >> experiencing >> in your core. I believe this is similar to what you are trying to do with >> your >> measurement. I think this is a perfectly good reason to do this. >> >> Going back to your example of a system-wide session, I think it would be >> easier >> to add enough smart to the tool to suppress uncore events to all but >> the first cpu >> of each socket given the list of monitored cpus (either all or >> --cpu-list). I think adding >> this to pfmon may not so trivial because of internal data structures, >> but it is doable. >> >> The alternative has some problems because you would not return an error >> when the >> uncore registers are written. Thus applications would not be able to >> tell apart whether >> a zero value on read is because no event occurred or because the event >> was suppressed. >> >> Another alternative would be to consider uncore session as a third >> kind of sessions distinct >> from system-wide. We would allow uncore sessions when there are >> per-thread and system-wide >> sessions. uncore sessions would only support uncore events, of course. >> You would need a >> distinct pfmon session for them. >> >> >> On Fri, Sep 18, 2009 at 8:41 PM, <gary.m...@bull.com> wrote: >>> Stephane >>> >>> We would like to be able to collect both core and uncore counters with >>> pfmon during >>> the same test run. This works (if you are careful) as shown below: >>> >>> [kirk] (hpctk) test_cases> pfmon --system-wide -u -k --cpu-list 0,1 -e >>> >> UNC_LLC_MISS:READ,UNHALTED_CORE_CYCLES,INSTRUCTIONS_RETIRED,FP_COMP_OPS_EX >> E:SSE_FP >>> ./LoopTest >>> >>> .... application dribble .... >>> >>> CPU0 12080 UNC_LLC_MISS:READ >>> CPU0 26709 UNHALTED_CORE_CYCLES >>> CPU0 9766 INSTRUCTIONS_RETIRED >>> CPU0 0 FP_COMP_OPS_EXE:SSE_FP >>> CPU1 197 UNC_LLC_MISS:READ >>> CPU1 29020 UNHALTED_CORE_CYCLES >>> CPU1 10715 INSTRUCTIONS_RETIRED >>> CPU1 0 FP_COMP_OPS_EXE:SSE_FP >>> >>> But our system also has cpu cores 2-15 which can not be included in the >> cpu >>> list >>> because they share the same cpu socket as 0 or 1 so the uncore event >> causes >>> a problem creating the perfmon session on behalf of those cpu cores. >>> >>> Would it be possible for pfmon to detect when multiple cpu cores on the >>> same >>> socket are included in the cpu list then only put the uncore events in >> the >>> event >>> list used when creating a session to the first cpu core on that socket. >>> Then >>> sessions to other cpu cores that share the same socket would contain >> only >>> the core events so that perfmon would allow sessions to all the cores. >>> >>> One other possible approach I considered is to leave pfmon alone and >> change >>> perfmon to just remove the uncore event from the event list when the >>> session is >>> created to the second cpu core on the same socket. This could possibly >> be >>> done where the error is currently being detected and then allow the >> session >>> to be created with a subset of the events (minus all uncore events) >>> requested by >>> the caller. >>> >>> If either of these approaches could be implemented it would make it >>> possible for >>> us to get all the data we need in a single test run (and that makes sure >>> the data is >>> consistent and complete). >>> >>> Just interested in your thoughts. >>> Gary >>> >>> >>> ------------------------------------------------------------------------ >> ------ >>> Come build with us! The BlackBerry® Developer Conference in SF, CA >>> is the only developer event you need to attend this year. Jumpstart your >>> developing skills, take BlackBerry mobile applications to market and >> stay >>> ahead of the curve. Join us from November 9-12, 2009. Register >> now! >>> http://p.sf.net/sfu/devconf >>> _______________________________________________ >>> perfmon2-devel mailing list >>> perfmon2-devel@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel >>> >> -------------------------------------------------------------------------- >> ---- >> Come build with us! The BlackBerry® Developer Conference in SF, CA >> is the only developer event you need to attend this year. Jumpstart your >> developing skills, take BlackBerry mobile applications to market and stay >> ahead of the curve. Join us from November 9-12, 2009. Register >> now! >> http://p.sf.net/sfu/devconf >> _______________________________________________ >> perfmon2-devel mailing list >> perfmon2-devel@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/perfmon2-devel > > > ------------------------------------------------------------------------------ > Come build with us! The BlackBerry® Developer Conference in SF, CA > is the only developer event you need to attend this year. Jumpstart your > developing skills, take BlackBerry mobile applications to market and stay > ahead of the curve. Join us from November 9-12, 2009. Register now! > http://p.sf.net/sfu/devconf > _______________________________________________ > perfmon2-devel mailing list > perfmon2-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/perfmon2-devel -- Robert J. Fowler Chief Domain Scientist, HPC Renaissance Computing Institute The University of North Carolina at Chapel Hill 100 Europa Dr, Suite 540 Chapel Hill, NC 27517 V: 919.445.9670 F: 919 445.9669 r...@renci.org ------------------------------------------------------------------------------ Come build with us! The BlackBerry® Developer Conference in SF, CA is the only developer event you need to attend this year. Jumpstart your developing skills, take BlackBerry mobile applications to market and stay ahead of the curve. Join us from November 9-12, 2009. Register now! http://p.sf.net/sfu/devconf _______________________________________________ perfmon2-devel mailing list perfmon2-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/perfmon2-devel