As folks may have noticed, I've been re-working my old 2015 dispatch patches that eliminate the network input-side queues in Ganesha.
Matt had wanted fully async non-blocking I-O. I've been poking at it for a week, and now am sure that's the wrong way to go. It might still be good for FSALs. Remains to be seen. DanG and Soumya are looking at that now. The devil in userland network I-O is system calls. Each epoll_wait is a system call. Each read or write is a system call. Each thread switch is a system call. My code in Ganesha v2.5 (NTIRPC v1.5) gets the network output down to one system call per request on a very hot thread. Cannot do better, as trying harder would just push the data into kernel buffers, possibly slowing our own output (for various reasons). Trying to re-work that for async non-blocking calls instead means many more system calls. Instead of one clean writev with the TCP fragment header and all ready buffers in one single call, we'd at minimum have a call, an epoll_wait, spawn another work thread, then another call and/or release the buffer, rinse and repeat. For a long buffer chain (the times we want more performance), we'd have much less performance -- roughly 2 + (3 * number of buffers) additional system calls. For common short response chains, still have the extra overhead of the epoll system call, doubling calls. Also, using writev minimizes buffer copies. Eliminating data copying will usually give far better performance. The only thing async output is saving is waiting threads. But I've already got the output threads down to the minimum (per interface). No gain here! On the input side, the truly optimum reduction in system calls would be one read to get the TCP fragment header and up to 1500 bytes of data, followed (only when needed) by another read to get the entire rest of long fragments in one fell swoop. With async input I've tried level triggered, and am getting spurious epoll read data signals. Googling shows that's been a problem since at least 2014, but possible to program around. Still, this could be better, had it not been terrible for output-side. Changing to edge triggered means that every good read would be followed by another read to make sure that we've gotten all the data. That is, common small reads turn into two (2) reads. Doubling our system calls in the common case is not the way to go.... In conclusion, with epoll we know when input data is available, so input threads aren't sitting around waiting anyway, and trying to minimize threads results in more system calls and poorer performance. NTIRPC already defaults to 200 worker threads. If we need more, we should allocate more. Memory should not be an issue. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel