johansen wrote: > >On Wed, Dec 16, 2009 at 09:17:43PM -0800, Krishnendu Sadhukhan wrote: >> I'd like to request sponsorship from the performance community to host >a >> project for NUMAtop for OpenSolaris.
Many thanks to Krish for the sponsorship. >> >> NUMAtop is a tool developed by Intel to help developers identify >> memory locality in NUMA systems. It's a top-like utility that shows >> the top N processes in the system and their memory locality, with >> those processes that have the worst memory locality will be at the top >> of the list. Developers can use that data to make their applications >> NUMA-aware and thus can improve their application performance. > >Will this just focus on memory locality? Scheduling is an integral part >of NUMA performance. A process may have all of its memory allocated >from one lgrp, but if it's never scheduled to run in the same lgrp as >its memory, you're obviously not going to see the desired performance. >It would be great if this tool could give the user a sense of how often >their lwps were run in any particular lgrp, as well as information about >where the address space has allocated its memory. > >-j We have a prototype draft ready and would like to contribute to the community once Krish setup a source gate for us. What we have done is really a start and of course need more better suggestions and feedbacks from the community. Here we did a prototype feature list, Yao and Sam could give more details. We really appreciate any comments. Thanks, -Aubrey numaptop feature proposal ============================= 1.1 Identify top N processes sorted by RMA#, LMA, IR, CPI and SYSLOAD are also reported for these processes. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - No pid name rma(k) lma(k) ir(m) cpi sysload 1 101365 java 44774.3 47250.4 95087.6 1.0 99.1% 2 100764 java 2.0 2.3 1.3 2.8 0.0% - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Here, Process context, Sysload=(all cycles used by all one process on all nodes)/(all cycles on all cores). 1.2. Attach to one process(PID), report detail NUMA migration info. RMA, LMA, IR and SYSLOAD are also reported for this process. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ****attach to pid 101365**** Nid rma(k) lma(k) ir(m) sysload 0 22705.5 23194.4 47448.4 49.4% 1 22133.6 24182.1 47046.1 49.2% - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Here, Process context, Sysload=(all cycles used by all one process on the specific node)/(all cycles on all cores of this specific node). 1.3. Attach to one process, report top N LWPs sorted by RMA#, LMA, IR, CPI and SYSLOAD are also reported for these LWPs. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ****attach to pid 101365**** No lwpid rma(k) lma(k) ir(m) cpi sysload 0 23 6425.2 5462.4 11235.7 1.0 12.3% 1 26 5749.4 5517.4 11706.9 1.0 12.2% - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Here, LWP context, sysload=(all cycles used by the thread on all nodes)/(all cycles on all cores) 1.4. Dig into one LWP of a process, report detail NUMA migration info. RMA, LMA, IR, CPI, SYSLOAD are also reported for this LWP. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - ****attach to pid 101365, dig into lwpid 23**** Nid rma(k) lma(k) ir(m) cpi sysload 0 0.0 0.0 0.0 0.0 0.0% 1 6405.9 5479.1 11215.8 1.0 12.2% - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Here, LWP context, sysload=(all cycles used by this thread on the specific node)/(all cycles on all cores/of this specific node). 1.5. Report node traffic. Here, RMA, LMA, IR are reported of all tracked processes and LWPs. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Nid rma(k) lma(k) ir(m) 0 23020.5 23453.3 48310.1 1 22132.8 23971.8 46869.3 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 1.6. Report node info. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - node 0: cpus(0, 2, 4, 6); mem = 6.0G, free = 4.7G node 1: cpus(1, 3, 5, 7); mem = 6.0G, free = 4.5G - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Here, RMA: remote memory access LMA: local memory access IR: instruction retired CPI: cycles per instruction Solaris kernel and lib enhancement for numatop =============================================== 2.1. libcpc 2.2. solaris kernel cpc module Current cpc kernel support doesn't separate thread performance counters among the logic CPUs that the thread migrates to/from. We implement this functionality and offer a new flag to the user to use it by cpc syscall. ------------------------------------------------------- Show an example to describe how the current kcpc works. 3.1. When thread A is switched on cpu0, the kernel programs the hardware performance counter on cpu0 to enable RMA counter. 3.2. When thread A is switched off cpu0, the kernel samples value from performance counter and stops it on cpu0. Add the sampling value to one structure of thread A. 3.3. If next time, thread A is migrated to cpu1, the steps like above, just cpu0 is replaced by cpu1. Total RMA of thread A is stored in one private structue of thread A. The caller get RMA value of thread A via libcpc interface. Libcpc get them from thread A's private structure and return it to caller. Our enhancement: 4.1. kcpc allocate one space (slot array) which is big enough to store the values of all counters on all logical cpus. Each slot is for storing one counter's value on one cpu. 4.2. When thread A is switched on cpuN, kernel program the counters on cpuN to start counting. When thread A is switched off cpuN, kernel sample and stop the counters on cpuN. Add the sampling value to the slot which is corresponding to the specified counter on cpuN. 4.3. kernel copy the content of slot array out. _______________________________________________ perf-discuss mailing list perf-discuss@opensolaris.org