Jason, I just got an email saying that Mellanox does infact use an ordering for reads and writes. So I think we can blame the CPU or the PCI for the unordered reads.
On Thu, Nov 14, 2013 at 3:05 PM, Jason Gunthorpe <[email protected]> wrote: > On Thu, Nov 14, 2013 at 01:12:55AM -0400, Anuj Kalia wrote: > >> So, another question: why are the reads unordered while the writes are >> ordered? I think by now we can assume write ordering (my experiments + >> MVAPICH uses it). Can the PCI reorder the reads issued by the HCA? > > Without fencing there is no gurantee in what order things are made > visible, and the CPU will flush its write buffers however it likes. I'm using fencing in the read experiment. The code at the server looks like this: while(1) { for(i = 0; i < EXTENT_CAPACITY; i++) { ptr[EXTENT_CAPACITY - i - 1] = iter; asm volatile ("" : : : "memory"); asm volatile("mfence" ::: "memory"); } iter ++; usleep(2000 + (rand() % 200)); } > The PCI subsystem can also re-order reads however it likes, that is > part of the PCI spec. In a 2 socket system don't be surprised if cache > lines on different sockets complete out of order. > Think of this as a classic multi-threaded race condition, and not > related to PCI. If you do the same test using 2 threads you probably > get the same results. > The PCI explanation sounds good. However, with a fence after every update, I don't think multiple sockets will be a problem. >> > Intel hardware is very good at hiding ordering issues 99% of the time, >> > but in many cases there can be a stress'd condition that will show a >> > different result. > >> Hmm.. I'm willing to run billions of iterations of the test. That >> should give some confidence. > > Not really, repeating the same test billions of times is not > comprehensive. You need to stress the system in all sorts of > different ways to see different behavior. Hmm.. It's not really the same test. My server sleeps for a randomly chosen large duration between updates. If the test passes for many iterations, we can assume that we've tested a lot of interleavings. But yes, that doesn't give 100% confidence. > For instance, in a 2 socket system there are likely all sorts of crazy > sensitivities that depend on which socket the memory lives, which > socket holds the newest cacheline, which socket has an old line, which > socket is connected directly to the HCA, etc. Again, does that matter with fences? With a fence after every update, there is a real time ordering for when the updates appear in the cache hierarchy regardless of the socket. > > Jason Regards, Anuj -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
