For the benefit of those who, like myself, still find the whole question
of cache-DMA interactions a bit confusing, here is a summary of what I've
learned.  Implicit is that we are writing for a least-common-denominator
architecture.  On the x86 none of these restrictions apply and most of
the considerations are necessary.  (Does anybody still worry about ISA 
DMA and its restriction to the first 16 MB of memory?)

Also implicit is that we are discussing I/O of kernel data.  Data that
will be passed directly to a user process should be treated differently.  
Finally, I will use the term "region" to refer to a block of memory
returned by kmalloc().

The first restriction is that some architectures (which ones? -- I don't
know) are unable to perform DMA to addresses on the stack.  So all I/O
buffers _must_ be allocated using kmalloc() or something similar; they
_cannot_ be automatic local variables.  (What about static allocation?)

The second restriction is that some architectures don't maintain their
cache coherently when DMA takes place.  This means that input buffers
_must_ be allocated in a region containing no data the CPU will touch
during the input operation.  For if the CPU touches data D lying in the
same cacheline as input buffer B while the input operation is underway,
the old contents of B will get loaded into the cache along with D.  
Then after the input completes, when the CPU tries to read the new data in
B it will see the stale contents in the cache instead.

If these two restrictions are obeyed then things will work correctly.  But 
there are additional considerations involving efficiency.  Each time an 
input operation completes, a cache miss necessarily occurs the first time 
the CPU tries to read the data.  On some architectures, the cache contents 
for output buffers are invalidated when the output takes place, so the 
next time the buffer is accessed will also incur a cache miss.  These 
misses are expensive and to be avoided if possible.

With input buffers, there's not much you can do.  If you have several
input buffers, and the data and input operations for them have
non-overlapping lifetimes, then there's no reason not to allocate them in
the same region organized like a C union -- starting at the beginning of
the region and using the same storage addresses.  In fact, this is to be
encouraged because it minimizes cache utilization, although it might
tend to make the code a little less clear.

The same is true for output buffers, of course.  But here there's more
flexibility.  Even if the buffers have overlapping lifetimes, you could
still allocate them in the same region organized like a C struct (using
different addresses).  However there's another efficiency concern.

Let's assume that each output buffer is smaller than a cacheline.  If each 
were given its own region then it would be aligned with the start of the 
region and hence occupy a single cacheline.  When data is stored in the 
buffer there would only be a single cache miss.  But if multiple buffers 
are allocated in the same region then only one of them will be aligned 
with the start.  The others will be unaligned, and some of them might very 
well straddle a cacheline boundary.  When data is stored in one of those 
buffers, there would be two cache misses, not just one.

If you know that the total size of the output buffers is no larger than
one cacheline, I don't see any reason not to allocate them like a struct
in a single region.  16 bytes is probably a good minimum size to assume
for cachelines; I don't think any modern systems use a smaller value.

There's a secondary concern here.  For systems that _don't_ invalidate the
cache for a write (such as x86), the reasoning above doesn't apply.  The
cache misses will occur the first time the buffers are filled but not
subsequently.  Furthermore, putting multiple output buffers in the same
region can save cache space overall because the buffers can share
cachelines rather than each having its own.  So the decision here is a
tradeoff, and which alternative is better will depend on the system
architecture.

What about allocating an output buffer along with other data (not I/O
buffers)?  As long as the output buffer is allocated first, at the start
of the region, the cacheline alignment will be okay.  But there's another,
more subtle efficiency consideration having to do with temporal locality
of reference.

Output buffers need to use the cache only for a limited period -- the time
during which the buffer is being filled.  Once the output starts, there is
no need for the buffer to remain in the cache, and some architectures do
indeed remove it.  But other data, not I/O buffers, in the same region
will presumably be accessed more often.  Some of this data will share a
cacheline with the buffer, and when the other data is accessed it will
drag the buffer into the cache along with it.  The total number of cache
misses doesn't change; it's still one per output operation at most.  But
the output buffer ends up occupying cache space unnecessarily.  Since the
cache is a limited resource, this again is something to be avoided.

To sum up, DMA I/O buffers should be allocated separately from regular
data structures, using kmalloc().  Multiple buffers with non-overlapping
lifetimes can share a region like a C union.  Multiple output buffers with
overlapping lifetimes can share a region like a C struct.  On some
architectures this can be done without penalty, one others there's no
penalty provided that the total size of the buffers is <= 16 bytes.  
Although output buffers may share a region with other data, they probably 
shouldn't.

If there's anything wrong or incomplete about this, please let me know.

Alan Stern



-------------------------------------------------------
This SF.net email is sponsored by:  Etnus, makers of TotalView, The best
thread debugger on the planet. Designed with thread debugging features
you've never dreamed of, try TotalView 6 free at www.etnus.com.
_______________________________________________
[EMAIL PROTECTED]
To unsubscribe, use the last form field at:
https://lists.sourceforge.net/lists/listinfo/linux-usb-devel

Reply via email to