Soumya and I have been working on-and-off for a couple of months on a design for both async callback and zero-copy, based upon APIs already implemented for Gluster. Once we have something comprehensive and well-written, I'd like to get feedback from other FSALs.
And of course, zero-copy is the whole point of RDMA. Earlier Gluster testing with Samba showed that zero-copy gave a better performance improvement than async IO. The underlying Linux OS calls only allow one or the other. For example, for TCP output in Gashesha V2.5/ntirpc 1.5, I've eliminated one task switch, but still use the writev() in a semi-dedicated thread, as there is no async writev() variant. We should have a measurable performance improvement (but it might be masked by all the MDCACHE changes). For FSALs we have the opportunity to design a combined system. Here's the current state of the design introduction: NFS-Ganesha direct data placement with reduced task switching and zero-copy Currently/Previously (Task switch 1.) Upon signalling (epoll), a master polling thread launches other worker threads, one for each signal. (Task switch 2.) If there is more than one concurrent request on the same transport tuple (IP source, IP destination, source port, destination port), the request is added to a stall queue. (Task switch 3.) During parsing the NFS input, each thread can wait for more data. (Task switch 4.) After parsing the NFS input, the thread queues the request according to several (4) priorities for handling by another worker thread. Requests are not handled in order. (Task switch 5.) While executing the NFS request, the thread can stall waiting for FSAL data. (Task switch 6.) After retrieving the resulting data, the thread hands-off the output to another thread to handle the system write. [Eliminated in Ganesha V2.5/ntirpc v1.5] Ideally (Task switch 1.) Upon signalling (epoll), the worker thread will make only one system call to accept the incoming connection. If there is more than one signal at a time, that same worker will queue the additional signals, queue another work request to handle the next signal, then continue hot processing the first signal. Note that this replaces the stall queue, as the latter threads utilize a worker pool and are sequentially executed in a fair queuing fashion. To remain hot, the thread checks for additional work before returning to the idle pool. (Task switch 2.) Instead of waiting for a read system call to complete, use a callback to schedule another worker thread, parse the NFS request, and call the appropriate FSAL.. If more [TCP, RDMA] data is needed for the request, the thread will save the state for the subsequent signal. (Task switch 3.) While executing the NFS request, the thread can stall waiting for FSAL data. The FSAL will return its result, and make a second system call to send output. In the case that FSAL result does not require a stall, no task switch is needed. To remain hot, the thread checks for additional output data before returning to the idle pool. Other threads will queue their output data. (As of Ganesha V2.5/ntirpc v1.5, this is implemented for TCP.) Input signal changes Currently, the (epoll) signal is blocked per fd after each fd signal. The input signal thread does not reinstate the fd signal until after input processing is complete. This causes a data backlog in the underlying OS, until data is dropped for lack of signal processing. Evidence that sawtoothed patterns appear in TCP, as the OS will acknowledge (ack) the data until no more data can be held, causing TCP stall and slow start. Ideally, the signal should never be blocked. Until the entire task scheme is upgraded according to this plan, this is not possible. So the block should be reinstated as soon as practicable, allowing new signals to be queued quickly. The signal queue(s) implemented for RDMA should be used for all signals. Preliminary testing by CEA demonstrated that up to 3,000 client connections could be handled during cold startup. However, this cannot be implemented until better asynchrony and parallelism is available. Transport parallelism Currently, on SVC_RECV() a new transport (SVCXPRT) is spawned for each incoming TCP and RDMA connection, but not for UDP connections. This requires extensive locking around UDP receive and send, as each incoming request uses the same buffers for input and output, and stores common data fields used by both input and output. There exists a UDP multi-threading window between the SVC_RECV() and SVC_SEND() – that is, the long-standing code is !MT-safe. Instead, spawn a new UDP transport for each incoming request. Rather than allocating a separate buffer for each UDP transport, append an IOQ buffer, replacing the rpc_buffer() pointer. This will keep the number of memory allocation calls and contention exactly the same as previously, and permit usage of the significantly faster duplex IOQ formatting calls. Receive asynchrony Currently, SVC_RECV() and SVC_GETARGS() requiring locking to prevent multi-threading, as each shares common data fields. As of Ganesha V2.5/ntirpc v1.5, locks were simplified. But one lock remains; it is currently cleared by calling SVC_STAT(). Instead, merge them into a replacement SVC_RECVARGS(), leaving SVC_RECV() for transport parallelism. The existing callback parameter can be repurposed to handle asynchronous I-O. It will not be called until all data for a request is present. That is, there will always be at least one task switch before the callback is handled. ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Nfs-ganesha-devel mailing list Nfs-ganesha-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel