Re: [perf-discuss] NUMAtop for OpenSolaris

Li, Aubrey Thu, 17 Dec 2009 18:08:37 -0800

johansen wrote:
>
>On Wed, Dec 16, 2009 at 09:17:43PM -0800, Krishnendu Sadhukhan wrote:
>> I'd like to request sponsorship from the performance community to host
>a
>> project for NUMAtop for OpenSolaris.


Many thanks to Krish for the sponsorship.
>>
>> NUMAtop is a tool developed by Intel to help developers identify
>> memory locality in NUMA systems. It's a top-like utility that shows
>> the top N processes in the system and their memory locality, with
>> those processes that have the worst memory locality will be at the top
>> of the list. Developers can use that data to make their applications
>> NUMA-aware and thus can improve their application performance.
>
>Will this just focus on memory locality?  Scheduling is an integral part
>of NUMA performance.  A process may have all of its memory allocated
>from one lgrp, but if it's never scheduled to run in the same lgrp as
>its memory, you're obviously not going to see the desired performance.
>It would be great if this tool could give the user a sense of how often
>their lwps were run in any particular lgrp, as well as information about
>where the address space has allocated its memory.
>
>-j

We have a prototype draft ready and would like to contribute to the community
once Krish setup a source gate for us. What we have done is really a start
and of course need more better suggestions and feedbacks from the community.
Here we did a prototype feature list, Yao and Sam could give more details.

We really appreciate any comments.

Thanks,
-Aubrey

numaptop feature proposal
=============================

1.1 Identify top N processes sorted by RMA#, 
LMA, IR, CPI and SYSLOAD are also reported for these processes.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
No      pid             name    rma(k)  lma(k)  ir(m)           cpi     sysload
1       101365  java    44774.3 47250.4 95087.6 1.0     99.1%
2       100764  java    2.0             2.3             1.3             2.8     
0.0%
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Here, Process context, 
Sysload=(all cycles used by all one process on all nodes)/(all cycles
on all cores).

1.2. Attach to one process(PID), report detail NUMA migration info.
RMA, LMA, IR and SYSLOAD are also reported for this process.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
****attach to pid 101365****
Nid     rma(k)          lma(k)          ir(m)                   sysload
0       22705.5         23194.4         47448.4         49.4%
1       22133.6         24182.1         47046.1         49.2%
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Here, Process context,
Sysload=(all cycles used by all one process on the specific node)/(all cycles
on all cores of this specific node).

1.3. Attach to one process, report top N LWPs sorted by RMA#,
LMA, IR, CPI and SYSLOAD are also reported for these LWPs.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
****attach to pid 101365****
No      lwpid   rma(k)  lma(k)          ir(m)           cpi     sysload
0       23      6425.2  5462.4          11235.7 1.0     12.3%
1       26      5749.4  5517.4          11706.9 1.0     12.2%
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Here, LWP context,
sysload=(all cycles used by the thread on all nodes)/(all cycles on all cores)

1.4. Dig into one LWP of a process, report detail NUMA migration info.
RMA, LMA, IR, CPI, SYSLOAD are also reported for this LWP.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
****attach to pid 101365, dig into lwpid 23****
Nid     rma(k)          lma(k)          ir(m)           cpi     sysload
0       0.0                     0.0                     0.0             0.0     
0.0%
1       6405.9          5479.1          11215.8 1.0     12.2%
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Here, LWP context,
sysload=(all cycles used by this thread on the specific node)/(all cycles
on all cores/of this specific node).

1.5. Report node traffic. Here, RMA, LMA, IR are reported of all tracked
processes and LWPs.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Nid     rma(k)          lma(k)          ir(m)
0       23020.5         23453.3         48310.1
1       22132.8         23971.8         46869.3
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

1.6. Report node info.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
node 0: cpus(0, 2, 4, 6); mem = 6.0G, free = 4.7G
node 1: cpus(1, 3, 5, 7); mem = 6.0G, free = 4.5G
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Here,
  RMA: remote memory access
  LMA: local memory access
  IR: instruction retired
  CPI: cycles per instruction

Solaris kernel and lib enhancement for numatop
===============================================
2.1. libcpc

2.2. solaris kernel cpc module

Current cpc kernel support doesn't separate thread performance counters
among the logic CPUs that the thread migrates to/from. We implement this
functionality and offer a new flag to the user to use it by cpc syscall.

-------------------------------------------------------
Show an example to describe how the current kcpc works. 

3.1. When thread A is switched on cpu0, the kernel programs the hardware 
   performance counter on cpu0 to enable RMA counter.
   
3.2. When thread A is switched off cpu0, the kernel samples value from 
   performance counter and stops it on cpu0. Add the sampling value to 
   one structure of thread A.

3.3. If next time, thread A is migrated to cpu1, the steps like above, 
   just cpu0 is replaced by cpu1. 

Total RMA of thread A is stored in one private structue of thread A. 
The caller get RMA value of thread A via libcpc interface. Libcpc get
them from thread A's private structure and return it to caller. 

Our enhancement:

4.1. kcpc allocate one space (slot array) which is big enough to store the 
   values of all counters on all logical cpus. Each slot is for storing 
   one counter's value on one cpu.
   
4.2. When thread A is switched on cpuN, kernel program the counters on cpuN
   to start counting. 
   
   When thread A is switched off cpuN, kernel sample and stop the counters
   on cpuN. Add the sampling value to the slot which is corresponding to 
   the specified counter on cpuN. 

4.3. kernel copy the content of slot array out.
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMAtop for OpenSolaris

Reply via email to