Re: [perf-discuss] An idea to enhance cpc

Jin Yao Thu, 01 Apr 2010 09:04:32 -0700

> Hi All, 
> 
> When I use cputrack to track one process on a numa
> system 
> (like nhm-ex) and want to see some performance events
> like 
> “RMA” (Remote Memory Access) which the process
> costs.
> 
> The cputrack can tell me the RMA value which the
> process costs 
> on the whole system, eg: in last 5s, it costs 1,000
> RMA on all 4 
> nodes (4 sockets). 
> 
> But sometime, I want to know the RMA costs per node,
> 
> eg, how many RMA the process costs on node1 or how
> many 
> it costs on node2?
> 
> The cputrack can't give me above result because cpc
> doesn't support
> seperating performance counter value per cpu for a
> thread/process. 
> So I want to provide a patch to enhance cpc to
> support this feature.
> 
> Does anybody think it's valuable? 
> 
> Thanks
> Jin Yao


oh, looks like the patch idea is lack of interests now. Please allow me 
to give a sample again to show it's value.

We run a stream benchmark on a 4 sockets system 2 times. The stream 
benchmark uses OpenMP for parallel and creates specifiled threads to do
the computation tasks. All computing threads must do barrier.

We see a big performance variation (10%) between two run. 
1. In the better run, all the computing threads run on their own home 
   lgroup.
2. In the worse run, part of threads are migrated from its home lgroup
   to other lgroup. Dtrace script confirms threads migration between 
   different lgroups with the worse run. 
   
We guess the root cause is the threads on home lgroup runs faster than 
other migrated off threads. But the fast threads have to wait the slow 
threads to complete the jobs during the barrier phase. 

If above guess is true, the migrated off threads should have a lot of RMA
(Remote Memory Access) on other nodes to access the memory on their home
lgroup. 

But unfortunately, we don't have such data or observation to support it. 
Because 
in current cpc implement, we can only get the total RMA of a thread on all 
cpus/nodes. There is no way to seperate them per cpu/node for a thread.
We don't know how many RMA the migrated off threads cost on other lgroup.
That's why I want to provide a small patch to enhance cpc. 

btw, another new question is raised, why scheduler migrates these threads from 
their home lgroup to other lgroup? It's maybe a interesting topic worth 
digging later. 

Thanks
Jin Yao
-- 
This message posted from opensolaris.org
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] An idea to enhance cpc

Reply via email to