For the benefit of those who, like myself, still find the whole question of cache-DMA interactions a bit confusing, here is a summary of what I've learned. Implicit is that we are writing for a least-common-denominator architecture. On the x86 none of these restrictions apply and most of the considerations are necessary. (Does anybody still worry about ISA DMA and its restriction to the first 16 MB of memory?)
Also implicit is that we are discussing I/O of kernel data. Data that will be passed directly to a user process should be treated differently. Finally, I will use the term "region" to refer to a block of memory returned by kmalloc(). The first restriction is that some architectures (which ones? -- I don't know) are unable to perform DMA to addresses on the stack. So all I/O buffers _must_ be allocated using kmalloc() or something similar; they _cannot_ be automatic local variables. (What about static allocation?) The second restriction is that some architectures don't maintain their cache coherently when DMA takes place. This means that input buffers _must_ be allocated in a region containing no data the CPU will touch during the input operation. For if the CPU touches data D lying in the same cacheline as input buffer B while the input operation is underway, the old contents of B will get loaded into the cache along with D. Then after the input completes, when the CPU tries to read the new data in B it will see the stale contents in the cache instead. If these two restrictions are obeyed then things will work correctly. But there are additional considerations involving efficiency. Each time an input operation completes, a cache miss necessarily occurs the first time the CPU tries to read the data. On some architectures, the cache contents for output buffers are invalidated when the output takes place, so the next time the buffer is accessed will also incur a cache miss. These misses are expensive and to be avoided if possible. With input buffers, there's not much you can do. If you have several input buffers, and the data and input operations for them have non-overlapping lifetimes, then there's no reason not to allocate them in the same region organized like a C union -- starting at the beginning of the region and using the same storage addresses. In fact, this is to be encouraged because it minimizes cache utilization, although it might tend to make the code a little less clear. The same is true for output buffers, of course. But here there's more flexibility. Even if the buffers have overlapping lifetimes, you could still allocate them in the same region organized like a C struct (using different addresses). However there's another efficiency concern. Let's assume that each output buffer is smaller than a cacheline. If each were given its own region then it would be aligned with the start of the region and hence occupy a single cacheline. When data is stored in the buffer there would only be a single cache miss. But if multiple buffers are allocated in the same region then only one of them will be aligned with the start. The others will be unaligned, and some of them might very well straddle a cacheline boundary. When data is stored in one of those buffers, there would be two cache misses, not just one. If you know that the total size of the output buffers is no larger than one cacheline, I don't see any reason not to allocate them like a struct in a single region. 16 bytes is probably a good minimum size to assume for cachelines; I don't think any modern systems use a smaller value. There's a secondary concern here. For systems that _don't_ invalidate the cache for a write (such as x86), the reasoning above doesn't apply. The cache misses will occur the first time the buffers are filled but not subsequently. Furthermore, putting multiple output buffers in the same region can save cache space overall because the buffers can share cachelines rather than each having its own. So the decision here is a tradeoff, and which alternative is better will depend on the system architecture. What about allocating an output buffer along with other data (not I/O buffers)? As long as the output buffer is allocated first, at the start of the region, the cacheline alignment will be okay. But there's another, more subtle efficiency consideration having to do with temporal locality of reference. Output buffers need to use the cache only for a limited period -- the time during which the buffer is being filled. Once the output starts, there is no need for the buffer to remain in the cache, and some architectures do indeed remove it. But other data, not I/O buffers, in the same region will presumably be accessed more often. Some of this data will share a cacheline with the buffer, and when the other data is accessed it will drag the buffer into the cache along with it. The total number of cache misses doesn't change; it's still one per output operation at most. But the output buffer ends up occupying cache space unnecessarily. Since the cache is a limited resource, this again is something to be avoided. To sum up, DMA I/O buffers should be allocated separately from regular data structures, using kmalloc(). Multiple buffers with non-overlapping lifetimes can share a region like a C union. Multiple output buffers with overlapping lifetimes can share a region like a C struct. On some architectures this can be done without penalty, one others there's no penalty provided that the total size of the buffers is <= 16 bytes. Although output buffers may share a region with other data, they probably shouldn't. If there's anything wrong or incomplete about this, please let me know. Alan Stern ------------------------------------------------------- This SF.net email is sponsored by: Etnus, makers of TotalView, The best thread debugger on the planet. Designed with thread debugging features you've never dreamed of, try TotalView 6 free at www.etnus.com. _______________________________________________ [EMAIL PROTECTED] To unsubscribe, use the last form field at: https://lists.sourceforge.net/lists/listinfo/linux-usb-devel
