Re: [perf-discuss] NUMAtop for OpenSolaris

Li, Aubrey Tue, 23 Feb 2010 22:45:57 -0800

Jonathan Chew wrote:
>
>
>Can you please explain what you mean by CPU, memory, and I/O sensitive?


- CPU sensitive application can be identified by CPU utilization. High CPU
utilization means the application is CPU sensitive.
- Memory sensitive application must be CPU sensitive. Besides this, it should
have high memory access(DMA case is excluded). A good example is a simple
loop-forever application, it's CPU sensitive but not memory sensitive.
- IO sensitive application has high CPI(cycles per instruction) number and
since it will wait for IO request, the CPU utilization is low.

>What do these have to do with memory locality?

These help to figure out if the performance issue is caused by memory locality.
If an application is memory sensitive and has RMA, there should be a 
relationship
between memory locality and reduced performance.

>
>
>>> So we have the following metrics:
>>>
>>> 1) sysload  -  cpu sensitive
>>>
>
>What do you mean by "sysload"?
>
It's CPU utilization.

>
>
>>> 2) LLC Miss per Instruction - memory sensitive
>>>
>
>So, is a memory sensitive thread one that has low or high LLC mis per
>instruction?
>

A memory sensitive thread has high LLC miss per instruction. 

>
>>> After we figure out the application is memory-sensitive, we'll check
>>> memory locality
>>> metrics to see what is the performance regression cause.
>>>
>
>How will you do that?  Do you mean that you will try to use the four
>metrics that you have listed here to determine the cause?
>

Right, the four metrics is able to determine if the regression is caused
by high RMA. 

>
>>> 3) LLC Latency Ratio(Average Latency for LLC Miss/Local Memory Access
>>> Latency)
>>>
>
>Will the latency for each LLC miss be measured then?

No, we don't measure the latency for each LLC miss. Instead, we'll give a
report of memory latency distribution. This is based on PEBS work. Krish or
Yao or Sam can give more details.
This is sort of like quantize in dtrace. We can give the distribution for 
application or thread.

>Is the local memory latency the *ideal* local memory latency when the system
>is unloaded or the *current* local memory latency which may be higher than
>the ideal because of load?
>

If the memory traffic is under the bandwidth threshold, the local memory latency
must be within a range. It is not an exact number, it can be an average, or
a range.

>
>>> 4) Source distribution for LLC miss:
>>>  -4.1)LMA/(Total LLC Miss Retired)%
>>>  -4.2)RMA/(Total LLC Miss Retired)%
>>>
>
>Will these ratios be given for each NUMA node, the whole system, or both?

We can give both.

>
>
>>> Here, 4.2) could be separated into different % onto different NUMA
>node
>>> hop.
>>>
>
>Do you mean that the total RMA will be broken down into percentage of
>remote memory accesses to each NUMA node from a given NUMA node?

Each NUMA node here is not necessary if the two nodes has the same hop from
the home node. The total RMA can be broken down into % of different NUMA node
hop.

>
>
>>> NUMAtop should have a useful report to show how effective the
>>> application is using the
>>> local memory.
>
>I think that someone already pointed out that you don't seem to mention
>anything about where the thread runs as part of your proposal even
>though that is pretty important in figuring out how effective a thread
>is using local memory.  The thread won't be very effective using local
>memory if it never runs on CPUs where its local memory lives.

Yeah, this is important and can easily be put into the report. 

>
>Also, the memory allocation policy may matter too.  For example, a
>thread may access remote memory a lot if it is accessing shared memory
>because the default memory allocation policy for shared memory is to
>spread it out by allocating it randomly across lgroups.

Right, that's exactly the part we need to figure out.

>
>
>>> We need PEBS framework to implement the metrics of NUMATOP,
>>> We need MPO
>>> sponsor and libcpc dtrace provider sponsor to figure out where is not
>>> effective and why.
>>>
>
>Ok.
>
>
>>> A better memory placement strategy suggestion is also a valuable goal
>of
>>> NUMATOP.
>>>
>
>How are you proposing to do that?
>
>Jonathan

As you mentioned above, we'll analysis all the NUMA related CPU and memory 
behavior in the kernel. Scheduling and memory allocation policy should be 
the key of the suggestion. For example,

As for scheduling.

when dispatch a thread, is it always scheduled onto home lgroup?
When choose a idle CPU, does it find within the same lgroup first?
Does cmt_balance() migrate the CPU within the home lgroup first?
If home lgroup is not available, does it choose CPU from the nearest node?

As for memory allocation,
Is the default policy NUMA-friendly enough for private memory and shared memory?
There should be another bunch of questions when we dig into the kernel,

DTRACE will be a good friend to help us to find the answer and give the
suggestion to the customer.

Thanks,
-Aubrey
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMAtop for OpenSolaris

Reply via email to