|
http://developers.sun.com/solaris/articles/optimizing_apps.html Article
IntroductionApplications that operate on large chunks of memory and are performance sensitive, such as speech recognition and portfolio management, can benefit from certain types of optimizations and yield performance gains of as much as 800 percent. The Solaris Operating Environment (OE) provides various facilities to help developers profile, optimize, and tune their applications. This paper introduces techniques and sample code on how to profile your application using the CPU Performance Counter Library (CPC) and use an alternative memory library called Intimate Shared Memory (ISM) to take advantage of your application's memory characteristics. Problem ExplanationMemory-intensive applications spend most of their time within the CPU working on memory that has either been fetched by an I/O operation or loaded at startup. At a lower level, the CPU spends its time doing one of the following:
Ideally, you want your CPU to spend its time working on data and instructions from the L1 cache to complete an operation in the least amount of time. Excessive address translations, load/stores, and pipeline stalls cause the CPU to be less efficient, thus affecting the performance of your application. During data access, the virtual memory location needs to be translated into a physical memory location. Since this is an expensive operation, owing to the number of CPU cycles it takes, a hardware-based Translation Lookaside Buffer (TLB) exists to cache the virtual to physical mapping operation. If a virtual address cannot be located within the TLB, it is said to be a "TLB miss" and incurs a performance hit. Since the TLB is hardware based, there is a limited number of entries it can hold at any one time; however, there is also a Translation Software Buffer (TSB), which can cache much larger entries than the TLB. Applications could cause a very high TLB miss depending on the data access pattern, and the size of the working set itself. Each entry in the TLB refers to one page in memory, which is 8 KB by default. On the UltraSPARC III processor, which has a TLB with 512 entries, the reach of the TLB would be 512 * 8 KB = 4 MB. A second TLB with 16 entries is also available on the UltraSPARC III processor. How can you tell if your application is executing efficiently, without excessive TLB misses? Profiling Your ApplicationThe Solaris 8 OE is bundled with tools and libraries to measure key metrics such as instruction misses, cache misses, and user and system events via hardware counters available on both the UltraSPARC and x86 processor families. To illustrate our profiling techniques, we've included a sample application that fetches data from random regions of memory and performs a simple computation. Here is a code snippet from the sample application: // declaring three float arrays (the random regions of which will be used) float * a; float * b; float * c;
// allocate the arrays
a = malloc(A_ROWS*A_SIZE*sizeof(float));
b = malloc(B_ROWS*B_SIZE*sizeof(float));
c = malloc(C_ROWS*C_SIZE*sizeof(float));
// operate on the selected region of data.
for (i = 0; i < IDX_SIZE; i++) {
f1 = (float*)&a[ARRAYPOS(idx[i])];
f2 = (float*)&b[ARRAYPOS(idx[i])];
f3 = (float*)&c[ARRAYPOS(idx[i])];
for (j = 0; j < A_SIZE; j++) {
tmp -= (f1[j] * f1[j] + f2[j] * f2[j] - f3[j] * f3[j]);
}
}
The complete program is available for download, with instructions on how to build the application at the end of this article. As can be seen above, we chose three float arrays and performed float multiplication on a random region within these arrays. There are several methods to obtain the runtime statistics of an application. The UltraSPARC platform offers hardware counters as a non-intrusive,
low-cost way of obtaining performance statistics of an application. The
Solaris 8 OE provides
We can obtain the data TLB misses for our sample program using bash-2.03# cputrack -T 5 -c "pic0=Cycle_cnt,pic1=DTLB_miss" ./problem_noism time lwp event pic0 pic1 5.013 1 tick 860024090 16312792 10.020 1 tick 1037678971 259355115 15.015 1 tick 1935944281 281515849 20.013 1 tick 1919678451 279156713 25.015 1 tick 1939066774 282090670 30.018 1 tick 1939571639 282109224 35.022 1 tick 1934818601 281518791 40.013 1 tick 1931992898 280934029 45.018 1 tick 1942837811 282686173 50.022 1 tick 1920288530 279174285 55.020 1 tick 1922443465 279760498
Notice how we obtain the // before the method to be instrumented cpc_strtoevent(cpuver, "pic1=DTLB_miss,pic0=Cycle_cnt", &event); cpc_bind_event(&event,0) // take a sample cpc_take_sample(&before); // call the method doit(); // after the method to be instrumented // take a sample after cpc_take_sample(&after); pic0 = ( after.ce_pic[0] - before.ce_pic[0]) ; pic1 = ( after.ce_pic[1] - before.ce_pic[1]) ; dtlbmiss = pic1; We now run our sample application to print several runtime statistics as shown: bash-2.03# ./problem_noism CPI = 6.885505 <- cycles per instruction D-Cache Miss Rate = 0.000058 <- L1 cache miss L2 miss rate = 0.024860 <- L2 cache miss FP Stall rate = 0.286592 <- float pipeline stalls DTLB miss rate = 0.299275 <- TLB misses rate (30%) FLOPS = 2157.783478 <- float operations per second elapsed time = 85.420990 secs <- total time to run Notice how about 30 percent of the address translations were not
found in the TLB. The total cost of TLB misses in a program could
amount to a big percentage of the elapsed time. For example, if our
program had an absolute TLB miss of 259355115 (picked from the (approximately 90 cycles are required to handle a TLB miss on an UltraSPARC III processor) cost of TLB miss (in secs) = (misses * 90 ) / (750 * 10^6) cost for 259355115 misses = ( 259355115 * 90) /(750 * 1000000) = 31.12 secs For a total run time of 85 seconds, the program spends 31 seconds handling TLB misses, making this situation very inefficient. Intimate Shared Memory (ISM)
The Solaris ISM facility allows applications to make use of 4 MB pages
instead of the default 8 KB pages, thus increasing the reach of the TLB
and its applications to access a larger working set without incurring
the cost of TLB misses. Hence, if we were to use 4 MB pages, the
example of the UltraSPARC III processor mentioned earlier with a 512
entry TLB would have a reach of 512 * 4 MB = 2 GB. You will need to
make modifications to the
To demonstrate the improvement in performance, we modify your sample
program to use ISM memory, replacing // replace calls to malloc with shmget()
// allocate a,b and c float arrays from ISM memory
aid = shmget(getpid()+0, A_ROWS*A_SIZE*sizeof(float), IPC_CREAT);
a = shmat(aid, (void *)0, SHM_SHARE_MMU);
bid = shmget(getpid()+1, A_ROWS*A_SIZE*sizeof(float), IPC_CREAT);
b = shmat(aid, (void *)0, SHM_SHARE_MMU);
cid = shmget(getpid()+2, A_ROWS*A_SIZE*sizeof(float), IPC_CREAT);
c = shmat(aid, (void *)0, SHM_SHARE_MMU);
We now recompile, using the "-DISM" to use ISM, and rerun our program modified to use ISM memory for the float arrays: bash-2.03# ./problem_ism hardware identifier 1002 CPI = 2.592287 D-Cache Miss Rate = 0.000022 L2 miss rate = 0.012475 FP Stall rate = 0.413361 DTLB miss rate = 0.000314 FLOPS = 12213.132359 elapsedtime = 15.091951 secs Notice the sea change in the performance of the application, with the DTLB miss rate almost eliminated. Also, because of the near-zero misses, we are able to accomplish more float operations. We area able to process 6 times as many float operations in just about 17 percent of the original time, indicating a performance increase by a factor of almost 30. Fine PrintSince ISM memory is locked , allocation of ISM sizes more than the physical RAM size would result in a failure. Unlike regular shared memory that could participate in swapping, ISM memory is not swapped out and is locked in physical RAM. Allocation of bigger page sizes could also result in higher fragmentation. ConclusionThe Solaris OE provides utilities, libraries, and system facilities
to allow Solaris software developers to profile and tune their
applications. Accessing hardware counters via Sample Program
You can download the sample program,
To use this facility, you first have to modify the set shmsys:shminfo_shmmax=4294967295 set shmsys:shminfo_shmmin=1 set shmsys:shminfo_shmmni=512 set shmsys:shminfo_shmseg=512
To compile with regular malloc:
To compile using ISM:
ReferencesThe SPARC Architecture Manual, Version 9 Techniques for Optimizing Applications: High Performance Computing by Rajat P. Garg, IIya Sharapov Delivering Performance on Sun: Optimization for Solaris Forte Developer Performance Tools by Marty Itzkowitz Performance Analysis and Monitoring Using Hardware Counters by Frederic Pariente About the AuthorsEzhilan Narasimhan is a member of the Market Development Engineering group at Sun Microsystems, Inc. He works with ISVs on performance engineering, database design, and application architecture. Nagendra Nagarajayya is also a member of the Market Development Engineering group at Sun Microsystems, Inc. He works with ISVs in the telco and retail segments on issues relating to architecture, sizing, performance tuning, and benchmarking. |

