http://www.29west.com/docs/THPM/measuring-cpu-scheduling-latency.html

20. Measuring CPU Scheduling Latency

Once latency due to repair loss has been removed from a messaging system, the next largest source is often CPU scheduling latency. This happens when a process is ready to run, but is unable to get a CPU on which to run. There is usually both a fixed and variable component in CPU scheduling latency. The remainder of this section describes tests we've run to better understand latency in messaging systems.

We ran a series of tests in our Latency Busters® Lab using Solaris 2.8 on a pair of SPARC machines. One machine ran 5 LBM sources, each producing a message stream of 2,000-byte messages at the rate of 10 messages per second. The other machine ran 10 LBM receivers, each consuming all the messages from each of the 5 sources. Hence the aggregate payload rate across all sources was 100,000 BPS and the aggregate payload rate across all receivers was 1,000,000 BPS. Our LBT-RM reliable multicast protocol was used so that the network bandwidth between machines was only 100,000 BPS plus protocol overhead.

We chose a relatively low message rate with a modest size payload for these tests to emphasize latency over throughput. Note that the 2,000-byte message payload together with the 1500-byte MTU of Ethernet force LBM to break every message into two Ethernet packets. This magnifies the effect of network loss on latency because the loss of either of the two packets would delay delivery.

Our first test run was meant to establish a baseline for further tests. We were particularly interested in the maximum latency measured over a long-running (20 minute) test. We measured a maximum latency of about 86 ms even though the average was only 1.8 ms. The plot below shows the latency measured for each of the 600,000 messages received for this test.

Figure 4. Baseline Latency Measurement

Only 1 in 1,000 messages had latency over 5.6 ms., however the above plot makes it apparent that many messages had much higher latencies.

Working on the assumption that CPU scheduling latency might be a significant factor in the large latency variance observed in the baseline, we tried increasing the CPU scheduling priority of the LBM processes. We used the command prefix nice --19 to elevate the priority of all LBM source and receiver processes to the maximum allowable under the Solaris "time sharing" scheduling class. Surprisingly, this had little effect on latency. The maximum measured latency remained unchanged. The plot below shows this.

Figure 5. Latency with Maximum Time Sharing Priority

The lack of a significant change lead us to conclude that if CPU scheduling latency was the cause of the large maximum latency, then it wasn't due to CPU contention with other user time sharing processes on the machine.

This led us to try a CPU scheduling class offered by Solaris called "real time." We used the command prefix priocntl -c RT -e to request real time scheduling for all LBM processes. This had a dramatic effect on the maximum latency, reducing it over 16 fold to just 5.2 ms. The plot below shows this.

Figure 6. Latency with Real Time Priority

The seemingly random incidence of high-latency messages completely disappeared when LBM ran with real time priority. We conclude that CPU scheduling latency was the primary cause of the large maximum latency we saw in earlier test. However, it seems that the source of CPU contention was something within the Solaris kernel since maximum user CPU priority produced no reduction in latency.

Copyright 2004 - 2009 29West, Inc.


Reply via email to