Article

Optimizing Applications with Large Working Sets in the Solaris 8 and 9 OS

By Ezhilan Narasimhan and Nagendra Nagarajayya, February 2002

Introduction

Applications that operate on large chunks of memory and are performance sensitive, such as speech recognition and portfolio management, can benefit from certain types of optimizations and yield performance gains of as much as 800 percent. The Solaris Operating Environment (OE) provides various facilities to help developers profile, optimize, and tune their applications. This paper introduces techniques and sample code on how to profile your application using the CPU Performance Counter Library (CPC) and use an alternative memory library called Intimate Shared Memory (ISM) to take advantage of your application's memory characteristics.

Problem Explanation

Memory-intensive applications spend most of their time within the CPU working on memory that has either been fetched by an I/O operation or loaded at startup. At a lower level, the CPU spends its time doing one of the following:

Address translation
Load/Store from on-chip L1 cache
Load/Store from off-chip L2 cache
Integer/Float operations
Branch operations
Pipeline stalls

Ideally, you want your CPU to spend its time working on data and instructions from the L1 cache to complete an operation in the least amount of time. Excessive address translations, load/stores, and pipeline stalls cause the CPU to be less efficient, thus affecting the performance of your application.

During data access, the virtual memory location needs to be translated into a physical memory location. Since this is an expensive operation, owing to the number of CPU cycles it takes, a hardware-based Translation Lookaside Buffer (TLB) exists to cache the virtual to physical mapping operation. If a virtual address cannot be located within the TLB, it is said to be a "TLB miss" and incurs a performance hit. Since the TLB is hardware based, there is a limited number of entries it can hold at any one time; however, there is also a Translation Software Buffer (TSB), which can cache much larger entries than the TLB.

Applications could cause a very high TLB miss depending on the data access pattern, and the size of the working set itself. Each entry in the TLB refers to one page in memory, which is 8 KB by default. On the UltraSPARC III processor, which has a TLB with 512 entries, the reach of the TLB would be 512 * 8 KB = 4 MB. A second TLB with 16 entries is also available on the UltraSPARC III processor.

How can you tell if your application is executing efficiently, without excessive TLB misses?

Profiling Your Application

The Solaris 8 OE is bundled with tools and libraries to measure key metrics such as instruction misses, cache misses, and user and system events via hardware counters available on both the UltraSPARC and x86 processor families. To illustrate our profiling techniques, we've included a sample application that fetches data from random regions of memory and performs a simple computation. Here is a code snippet from the sample application:

// declaring three float arrays (the random regions of which will be used)
float * a;
float * b;
float * c;


// allocate the arrays
        a = malloc(A_ROWS*A_SIZE*sizeof(float));
        b = malloc(B_ROWS*B_SIZE*sizeof(float));
        c = malloc(C_ROWS*C_SIZE*sizeof(float));

// operate on the selected region of data.
 
       for (i = 0; i < IDX_SIZE; i++) {
                f1 = (float*)&a[ARRAYPOS(idx[i])];
                f2 = (float*)&b[ARRAYPOS(idx[i])];
                f3 = (float*)&c[ARRAYPOS(idx[i])];
                for (j = 0; j < A_SIZE; j++) {
                        tmp -= (f1[j] * f1[j] + f2[j] * f2[j] - f3[j] * f3[j]);
                }
        }

The complete program is available for download, with instructions on how to build the application at the end of this article. As can be seen above, we chose three float arrays and performed float multiplication on a random region within these arrays. There are several methods to obtain the runtime statistics of an application.

The UltraSPARC platform offers hardware counters as a non-intrusive, low-cost way of obtaining performance statistics of an application. The Solaris 8 OE provides cputrack and cpustat utilities that may be used to read these counters. The CPU Performance Counter (CPC) library libcpc.so comes bundled with the Solaris 8 platform and provides programmatic access to these counters via a "C" API. The analyzer in Forte software makes use of this API, and provides a graphical front end for ease of use. Our sample example program makes use of this API and demonstrates how to measure and decide if we need to optimize and tune the application.

Figure 1. Data Access Process

We can obtain the data TLB misses for our sample program using cputrack as follows.

bash-2.03# cputrack -T 5 -c "pic0=Cycle_cnt,pic1=DTLB_miss" ./problem_noism

   time lwp      event      pic0      pic1
  5.013   1       tick 860024090  16312792
 10.020   1       tick 1037678971 259355115
 15.015   1       tick 1935944281 281515849
 20.013   1       tick 1919678451 279156713
 25.015   1       tick 1939066774 282090670
 30.018   1       tick 1939571639 282109224
 35.022   1       tick 1934818601 281518791
 40.013   1       tick 1931992898 280934029
 45.018   1       tick 1942837811 282686173
 50.022   1       tick 1920288530 279174285
 55.020   1       tick 1922443465 279760498

Notice how we obtain the DTLB_miss counter (pic1) every 5 seconds. We can obtain at most two counters during one run of cputrack, however, to obtain readings from more than two counters, we can use of the CPC API, which is illustrated in the sample program. To read multiple counters, we just reset the counters to different events and call the compute method multiple times. Here is a snippet on how to assign and obtain information from a single event (data TLB misses).

// before the method to be instrumented
cpc_strtoevent(cpuver, "pic1=DTLB_miss,pic0=Cycle_cnt", &event);
cpc_bind_event(&event,0)
// take a sample
cpc_take_sample(&before);

// call the method 
doit();

// after the method to be instrumented
// take a sample after
cpc_take_sample(&after);

pic0 = ( after.ce_pic[0] - before.ce_pic[0]) ;
pic1 = ( after.ce_pic[1] - before.ce_pic[1]) ;
dtlbmiss = pic1;

We now run our sample application to print several runtime statistics as shown:

bash-2.03# ./problem_noism
CPI = 6.885505      <- cycles per instruction
D-Cache Miss Rate = 0.000058   <- L1 cache miss
L2 miss rate = 0.024860   <- L2 cache miss
FP Stall rate = 0.286592    <- float pipeline stalls
DTLB miss rate = 0.299275    <- TLB misses rate (30%)
FLOPS = 2157.783478  <-  float operations per second
elapsed time = 85.420990 secs  <- total time to run

Notice how about 30 percent of the address translations were not found in the TLB. The total cost of TLB misses in a program could amount to a big percentage of the elapsed time. For example, if our program had an absolute TLB miss of 259355115 (picked from the cputrack example above for this same program), the cost to handle this many misses on a 750 MHz UltraSPARC III processor would be:

(approximately 90 cycles are required to handle a TLB miss on an UltraSPARC III processor)

cost of TLB miss (in secs) = (misses * 90 ) /  (750 * 10^6)
cost for  259355115 misses = ( 259355115 * 90) /(750 * 1000000) = 31.12 secs

For a total run time of 85 seconds, the program spends 31 seconds handling TLB misses, making this situation very inefficient.

Intimate Shared Memory (ISM)

The Solaris ISM facility allows applications to make use of 4 MB pages instead of the default 8 KB pages, thus increasing the reach of the TLB and its applications to access a larger working set without incurring the cost of TLB misses. Hence, if we were to use 4 MB pages, the example of the UltraSPARC III processor mentioned earlier with a 512 entry TLB would have a reach of 512 * 4 MB = 2 GB. You will need to make modifications to the /etc/system file to support shared memory. See the "Sample Program" section for an example.

To demonstrate the improvement in performance, we modify your sample program to use ISM memory, replacing malloc() with shmget().

// replace calls to malloc with shmget()
// allocate a,b and c float arrays from ISM memory

        aid = shmget(getpid()+0, A_ROWS*A_SIZE*sizeof(float), IPC_CREAT);
        a = shmat(aid, (void *)0, SHM_SHARE_MMU);
        bid = shmget(getpid()+1, A_ROWS*A_SIZE*sizeof(float), IPC_CREAT);
        b = shmat(aid, (void *)0, SHM_SHARE_MMU);
        cid = shmget(getpid()+2, A_ROWS*A_SIZE*sizeof(float), IPC_CREAT);
        c = shmat(aid, (void *)0, SHM_SHARE_MMU);

We now recompile, using the "-DISM" to use ISM, and rerun our program modified to use ISM memory for the float arrays:

bash-2.03# ./problem_ism
hardware identifier 1002
CPI = 2.592287 
D-Cache Miss Rate = 0.000022 
L2 miss rate = 0.012475 
FP Stall rate = 0.413361 
DTLB miss rate = 0.000314 
FLOPS = 12213.132359 
elapsedtime = 15.091951 secs

Notice the sea change in the performance of the application, with the DTLB miss rate almost eliminated. Also, because of the near-zero misses, we are able to accomplish more float operations. We area able to process 6 times as many float operations in just about 17 percent of the original time, indicating a performance increase by a factor of almost 30.

Fine Print

Since ISM memory is locked , allocation of ISM sizes more than the physical RAM size would result in a failure. Unlike regular shared memory that could participate in swapping, ISM memory is not swapped out and is locked in physical RAM. Allocation of bigger page sizes could also result in higher fragmentation.

Conclusion

The Solaris OE provides utilities, libraries, and system facilities to allow Solaris software developers to profile and tune their applications. Accessing hardware counters via cputrack, cpustat, or programatically through the CPU Performance Counter Library (libcpc.so) can help you profile your application, while Intimate Shared Memory (ISM) can speed up applications that spend a lot of time in TLB miss handlers relative to their total runtime. The Solaris 9 platform supports multiple page sizes by default, which would eliminate the need to use ISM and modifications needed in /etc/system to support shared memory.

Sample Program

You can download the sample program, optimizing_apps_sample.c, here.

To use this facility, you first have to modify the /etc/system file and add the following kernel parameters indicating shared memory values. You will need to reboot after this modification.

	set shmsys:shminfo_shmmax=4294967295
	set shmsys:shminfo_shmmin=1
	set shmsys:shminfo_shmmni=512
	set shmsys:shminfo_shmseg=512

To compile with regular malloc:
cc -fast -o problem_noism optimizing_apps_sample.c -lcpc

To compile using ISM:
cc -fast -DISM -o problem_ism optimizing_apps_sample.c -lcpc

References

The SPARC Architecture Manual, Version 9

Techniques for Optimizing Applications: High Performance Computing by Rajat P. Garg, IIya Sharapov

Delivering Performance on Sun: Optimization for Solaris

Forte Developer Performance Tools by Marty Itzkowitz

Performance Analysis and Monitoring Using Hardware Counters by Frederic Pariente

About the Authors

Ezhilan Narasimhan is a member of the Market Development Engineering group at Sun Microsystems, Inc. He works with ISVs on performance engineering, database design, and application architecture.

Nagendra Nagarajayya is also a member of the Market Development Engineering group at Sun Microsystems, Inc. He works with ISVs in the telco and retail segments on issues relating to architecture, sizing, performance tuning, and benchmarking.

[linuxkernelnewbies] Optimizing Applications with Large Working Sets in the Solaris 8 and 9 OS