Re: [rtl] memory access time: how to fold into timing computations?

David Olofson Tue, 09 May 2000 08:25:58 -0700
Mon, 08 May 2000 Jeff Ridgway wrote:
> >%_Hello RTLinux community,
> 
> I am sure that many of you have computed the amount of time necessary to
> accomplish various tasks in real-time operations.  I am trying to factor in
> memory access. I have from a personal conversation that the typical memory
> bus runs at 100 MHz.

Well, yes, on newer PC's the FSB and the RAM can handle 100 MHz, at least for
sequential accesses. (Not 100% sure about random, single word at a time
access, though...) The bus is (still!?!) 64 bits, as opposed to PowerPC, Alpha
and others who switched to 128 bit memory busses quite a while ago. I havesn't
kept up with the latest developments, so I'm not sure about the bus width of the
new 133 MHz modules and the mainboards that use them.

For the average Pentium main board, the figure is 1/60ns = 17 MHz, limited by
the SIMMs. Fortunately, the bus is 64 bits; not 32.

> Thus I am assuming that one can access a32-bit word in 10 nano-second (i.e.
> access time about 1 byte per 2.5 nano-second).   Thus, to read/write from/to
> an array of 1MB in RAM should take 2.5 milli-seconds. Does this figure stack
> up with your experience or am I way off of the mark?

Yes, as long as you get the right figures for the bus, these calculations give
quite realistic results. However, there are some things to keep in mind, that
complicate things a great deal for some architectures:

  * Cache write behavior (write-through/cached)

    Most Pentium systems don't handle write operations very well.
    If you write one byte at a time, you will end up with just one
    byte written every memory cycle! Actually, not even that, as
    the cache occasionally will have to fetch a line, so that it
    can assemble the memory words without having to do *two*
    accesses (read-modify-write) for every byte. You have to make
    sure the code writes 32 bit or 64 bit (MMX) words. If you're
    doing some byte level processing, you can use registersfor the
    operations, and then write one full word after 4 or 8 "loops".

    Reads are handled nicely, though, and you can usually get
    almost full bandwidth even with byte accesses.

  * CPU pipeline efficiency (can all cycles be utilized?)

    If you're doing more than copying data on a slow CPU, it's
    possible that the task changes from memory bound to CPU bound.
    The latter was practically *always* the case with 8 bit and
    older 16 bit CPUs, even when copying data, but these days,
    it's usually the other way around. You can do quite a few
    operations during a single memory cycle.

  * Alignment

    Unaligned reads, and in particular, unaligned writes should
    be avoided, as they cause a performance hit even on the latest
    CPUs. As with single byte/word accesses on longword/quadword
    busses, the problem is less severe for reads (as it's handled
    nicely by the cache read-ahead).

  * Multiple associativity

    If you dereference a lot of pointers to different memory
    blocks during the "inner loop" of your program, you may hit
    the limit of how many hot areas the cache controller can keep
    track of, and thus limit it's ability to read ahead and cache
    data efficiently. This may cause the memory random access (as
    opposed to sequeltial access as done when fetching a cache
    line) penalty to severely impact the bandwidth. In such cases,
    it's often better to process less channels, object, (...) at
    a time, even if it means that you have to store intermediate
    data in extra buffers.

    (Another problem that can cause similar effects under the same
    circumstances is the CPU running out of registers, forcing the
    compiler [or asm coder] to use temporary variables on the
    stack. Happens all the time with x86...)

  * Memory access modes

    Many memory technologies used today utilize various kinds of
    burst transfer modes to improve the bandwidth. This means that
    you can tell the RAM where to start reading or writing, and
    then transfer multiple memory words sequentially without
    additional address cycles. It's important to keep in bind that
    many modern systems have *very* different timings for random
    access and sequential transfers.

  * Multiple busses, bridges...

    If you're going to access a video card, you have to take (at
    least) 4 things in account:

    1) DON'T read from the video RAM! Nearly all cards are very
       slow here, as they're optimized only for writing.

    2) The bus (ISA, PCI, AGP,...) may be a problematic
       bottleneck when doing direct access. Older machines with
       built-in ISA cards may even have 8 bit busses, that *slow
       down* if you try to do 16 or 32 bit accesses! (I've seen
       this on some HP machines. Amazing...)

    3) The video RAM is usually faster than the main memory, at
       least on older machines, but this doesn't matter on
       anything but 286 class computers and older because of the
       bus bottleneck. And...

    4) ...the RAMDAC might use up quite a few of the memory
       cycles, and may also interfere with the video RAM/bus
       controller's ability to use burst transfers for greater
       bandwidth. If you use the hardware blitter together with
       CPU direct access, the blitter will also be in the fight
       for video memory cycles.

  * DMA

    PCI and AGP bus master DMA cards may steal significant amounts
    of memory bandwidth on some systems, such as the ones with those
    terrible integrated video chipsets, using system RAM for frame
    buffers...


Uhm, long post again... Hopefully not too much missinformation, though. ;-)

Anyway, is many cases you can get a rough idea about the performance if you
know the hardware well, and the information above could be a starting point when
optimizing code, but the only practical way to get reliable figures is to test
the code on the target platform.


David Olofson
 Programmer
 Reologica Instruments AB
 [EMAIL PROTECTED]

..- M u C o S --------------------------------. .- David Olofson ------.
|           A Free/Open Multimedia           | |     Audio Hacker     |
|      Plugin and Integration Standard       | |    Linux Advocate    |
`------------> http://www.linuxdj.com/mucos -' | Open Source Advocate |
..- A u d i a l i t y ------------------------. |        Singer        |
|  Rock Solid Low Latency Signal Processing  | |      Songwriter      |
`---> http://www.angelfire.com/or/audiality -' `-> [EMAIL PROTECTED] -'

-- [rtl] ---
To unsubscribe:
echo "unsubscribe rtl" | mail [EMAIL PROTECTED] OR
echo "unsubscribe rtl <Your_email>" | mail [EMAIL PROTECTED]
---
For more information on Real-Time Linux see:
http://www.rtlinux.org/rtlinux/
Re: [rtl] memory access time: how to fold into timing computations?

Reply via email to