Re: [perf-discuss] NUMAtop for OpenSolaris

zhihui Chen Tue, 23 Feb 2010 23:20:32 -0800

Some supplements:

On Wed, Feb 24, 2010 at 2:43 PM, Li, Aubrey <aubrey...@intel.com> wrote:
> Jonathan Chew wrote:
>>
>>
>>Can you please explain what you mean by CPU, memory, and I/O sensitive?
>
> - CPU sensitive application can be identified by CPU utilization. High CPU
> utilization means the application is CPU sensitive.
> - Memory sensitive application must be CPU sensitive. Besides this, it should
> have high memory access(DMA case is excluded). A good example is a simple
> loop-forever application, it's CPU sensitive but not memory sensitive.
> - IO sensitive application has high CPI(cycles per instruction) number and
> since it will wait for IO request, the CPU utilization is low.
>
We divided application into these three catagories because only
memory-sensitive application will be impacted by remote memory access
on NUMA platform and NUMAtop will tell user whether his application
are memory-instensive through sysload and LLC miss ratio.


>>What do these have to do with memory locality?
>
> These help to figure out if the performance issue is caused by memory 
> locality.
> If an application is memory sensitive and has RMA, there should be a 
> relationship
> between memory locality and reduced performance.
>
>>
>>
>>>> So we have the following metrics:
>>>>
>>>> 1) sysload  -  cpu sensitive
>>>>
>>
>>What do you mean by "sysload"?
>>
> It's CPU utilization.
>
>>
>>
>>>> 2) LLC Miss per Instruction - memory sensitive
>>>>
>>
>>So, is a memory sensitive thread one that has low or high LLC mis per
>>instruction?
>>
>
> A memory sensitive thread has high LLC miss per instruction.
>
>>
>>>> After we figure out the application is memory-sensitive, we'll check
>>>> memory locality
>>>> metrics to see what is the performance regression cause.
>>>>
>>
>>How will you do that?  Do you mean that you will try to use the four
>>metrics that you have listed here to determine the cause?
>>
>
> Right, the four metrics is able to determine if the regression is caused
> by high RMA.
>
>>
>>>> 3) LLC Latency Ratio(Average Latency for LLC Miss/Local Memory Access
>>>> Latency)
>>>>
>>
>>Will the latency for each LLC miss be measured then?
>
> No, we don't measure the latency for each LLC miss. Instead, we'll give a
> report of memory latency distribution. This is based on PEBS work. Krish or
> Yao or Sam can give more details.
> This is sort of like quantize in dtrace. We can give the distribution for
> application or thread.
>
Intel Processor has a load latency facility so that it can tag memory
load operations. There memory load operation with tag will have the
latency info. The facility just random tags the memory load operation,
so this is a profiling mechanism and cannot measure the latency for
all LLC miss. But this profiling method with its PEBS feature is
enough to provide the latency distribution  for memory access of the
application.

>>Is the local memory latency the *ideal* local memory latency when the system
>>is unloaded or the *current* local memory latency which may be higher than
>>the ideal because of load?
>>
>
> If the memory traffic is under the bandwidth threshold, the local memory 
> latency
> must be within a range. It is not an exact number, it can be an average, or
> a range.
We plan to use "current" local memory latency and it is very easily
gotten with the load latency facility.

>
>>
>>>> 4) Source distribution for LLC miss:
>>>>  -4.1)LMA/(Total LLC Miss Retired)%
>>>>  -4.2)RMA/(Total LLC Miss Retired)%
>>>>
>>
>>Will these ratios be given for each NUMA node, the whole system, or both?
>
> We can give both.
>
>>
>>
>>>> Here, 4.2) could be separated into different % onto different NUMA
>>node
>>>> hop.
>>>>
>>
>>Do you mean that the total RMA will be broken down into percentage of
>>remote memory accesses to each NUMA node from a given NUMA node?
>
> Each NUMA node here is not necessary if the two nodes has the same hop from
> the home node. The total RMA can be broken down into % of different NUMA node
> hop.
>
If we have the memory latency distrbution for application, we dont
need care % to different NUMA node or % to different distance(hop#).
The user will be interested in why there is a high percentage of
memory access which has 2X or 3X times latency to local memory
latency.

>>
>>
>>>> NUMAtop should have a useful report to show how effective the
>>>> application is using the
>>>> local memory.
>>
>>I think that someone already pointed out that you don't seem to mention
>>anything about where the thread runs as part of your proposal even
>>though that is pretty important in figuring out how effective a thread
>>is using local memory.  The thread won't be very effective using local
>>memory if it never runs on CPUs where its local memory lives.
>
> Yeah, this is important and can easily be put into the report.
>
>>
>>Also, the memory allocation policy may matter too.  For example, a
>>thread may access remote memory a lot if it is accessing shared memory
>>because the default memory allocation policy for shared memory is to
>>spread it out by allocating it randomly across lgroups.
>
> Right, that's exactly the part we need to figure out.
>
>>
>>
>>>> We need PEBS framework to implement the metrics of NUMATOP,
>>>> We need MPO
>>>> sponsor and libcpc dtrace provider sponsor to figure out where is not
>>>> effective and why.
>>>>
>>
>>Ok.
>>
>>
>>>> A better memory placement strategy suggestion is also a valuable goal
>>of
>>>> NUMATOP.
>>>>
>>
>>How are you proposing to do that?
>>
>>Jonathan
>
> As you mentioned above, we'll analysis all the NUMA related CPU and memory
> behavior in the kernel. Scheduling and memory allocation policy should be
> the key of the suggestion. For example,
>
> As for scheduling.
>
> when dispatch a thread, is it always scheduled onto home lgroup?
> When choose a idle CPU, does it find within the same lgroup first?
> Does cmt_balance() migrate the CPU within the home lgroup first?
> If home lgroup is not available, does it choose CPU from the nearest node?
>
> As for memory allocation,
> Is the default policy NUMA-friendly enough for private memory and shared 
> memory?
> There should be another bunch of questions when we dig into the kernel,
>
> DTRACE will be a good friend to help us to find the answer and give the
> suggestion to the customer.
>
> Thanks,
> -Aubrey
> _______________________________________________
> perf-discuss mailing list
> perf-discuss@opensolaris.org
>

Thanks
Zhihui
_______________________________________________
perf-discuss mailing list
perf-discuss@opensolaris.org

Re: [perf-discuss] NUMAtop for OpenSolaris

Reply via email to