> both handling pixel graphics and transferring to graphic card are special > cases. > speedup may be due to better prefetch during sequential memory access, but > larger data size should not help much here. > more data causes FSB and PCIe contention, and cache trashing. oops?
pci "memory" is not prefetched. if you're stuffing bytes in you can use the write-combining memory type to get pretty good performance for writes (there's no similar trick for reads). but generally dma is used to move large chunks where performance matters. regardless of dma, larger data sizes *do* help. like any other network protocol, there's a header and whatnot. the minimum tlp for a write is 4 bytes. ignoring other overhead, that's 25% data for 4-byte integers and ~6% data for byte writes. since pcie-3 is 128/130 encoded, the minimum is now 4 bytes. (quiz: why could this make keeping the plls synced difficult?) all the 10gbe vendors crank it up to 11 and use 4kb transfers when possible. all that i've seen can't hit their theoretical maximum frame rate with 60-byte frames. too much overhead. then there's the latency. in the kernel i use, i keep the cumulative time spend in irq handlers. this is useful to see if changes help or hurt irq latency. in one case, i found that going from 1 to 2 pcie 4-byte register reads doubled the time in that irq handler. - erik
