Soumya and I have been working on-and-off for a couple of months on a
design for both async callback and zero-copy, based upon APIs already
implemented for Gluster.  Once we have something comprehensive and
well-written, I'd like to get feedback from other FSALs.

And of course, zero-copy is the whole point of RDMA.  Earlier Gluster
testing with Samba showed that zero-copy gave a better performance
improvement than async IO.

The underlying Linux OS calls only allow one or the other.  For example,
for TCP output in Gashesha V2.5/ntirpc 1.5, I've eliminated one task
switch, but still use the writev() in a semi-dedicated thread, as there
is no async writev() variant.  We should have a measurable performance
improvement (but it might be masked by all the MDCACHE changes).

For FSALs we have the opportunity to design a combined system.

Here's the current state of the design introduction:

NFS-Ganesha direct data placement
with reduced task switching and zero-copy

Currently/Previously

(Task switch 1.)  Upon signalling (epoll), a master polling thread launches 
other worker threads, one for each signal.

(Task switch 2.)  If there is more than one concurrent request on the same 
transport tuple (IP source, IP destination, source port, destination port), the 
request is added to a stall queue.

(Task switch 3.)  During parsing the NFS input, each thread can wait for more 
data.

(Task switch 4.)  After parsing the NFS input, the thread queues the request 
according to several (4) priorities for handling by another worker thread.  
Requests are not handled in order.

(Task switch 5.)  While executing the NFS request, the thread can stall waiting 
for FSAL data.

(Task switch 6.)  After retrieving the resulting data, the thread hands-off the 
output to another thread to handle the system write. [Eliminated in Ganesha 
V2.5/ntirpc v1.5]

Ideally

(Task switch 1.)  Upon signalling (epoll), the worker thread will make only one 
system call to accept the incoming connection.

If there is more than one signal at a time, that same worker will queue the 
additional signals, queue another work request to handle the next signal, then 
continue hot processing the first signal. Note that this replaces the stall 
queue, as the latter 
threads utilize a worker pool and are sequentially executed in a fair queuing 
fashion.

To remain hot, the thread checks for additional work before returning to the 
idle pool.

(Task switch 2.)  Instead of waiting for a read system call to complete, use a 
callback to schedule another worker thread, parse the NFS request, and call the 
appropriate FSAL..

If more [TCP, RDMA] data is needed for the request, the thread will save the 
state for the subsequent signal.

(Task switch 3.)  While executing the NFS request, the thread can stall waiting 
for FSAL data. The FSAL will return its result, and make a second system call 
to send output. In the case that FSAL result does not require a stall, no task 
switch is needed.

To remain hot, the thread checks for additional output data before returning to 
the idle pool. Other threads will queue their output data. (As of Ganesha 
V2.5/ntirpc v1.5, this is implemented for TCP.)

Input signal changes

Currently, the (epoll) signal is blocked per fd after each fd signal. The input 
signal thread does not reinstate the fd signal until after input processing is 
complete. This causes a data backlog in the underlying OS, until data is 
dropped for lack of 
signal processing. Evidence that sawtoothed patterns appear in TCP, as the OS 
will acknowledge (ack) the data until no more data can be held, causing TCP 
stall and slow start.

Ideally, the signal should never be blocked. Until the entire task scheme is 
upgraded according to this plan, this is not possible. So the block should be 
reinstated as soon as practicable, allowing new signals to be queued quickly.

The signal queue(s) implemented for RDMA should be used for all signals. 
Preliminary testing by CEA demonstrated that up to 3,000 client connections 
could be handled during cold startup. However, this cannot be implemented until 
better asynchrony and 
parallelism is available.

Transport parallelism

Currently, on SVC_RECV() a new transport (SVCXPRT) is spawned for each incoming 
TCP and RDMA connection, but not for UDP connections. This requires extensive 
locking around UDP receive and send, as each incoming request uses the same 
buffers for input and 
output, and stores common data fields used by both input and output. There 
exists a UDP multi-threading window between the SVC_RECV() and SVC_SEND() – 
that is, the long-standing code is !MT-safe.

Instead, spawn a new UDP transport for each incoming request. Rather than 
allocating a separate buffer for each UDP transport, append an IOQ buffer, 
replacing the rpc_buffer() pointer. This will keep the number of memory 
allocation calls and contention 
exactly the same as previously, and permit usage of the significantly faster 
duplex IOQ formatting calls.

Receive asynchrony

Currently, SVC_RECV() and SVC_GETARGS() requiring locking to prevent 
multi-threading, as each shares common data fields. As of Ganesha V2.5/ntirpc 
v1.5, locks were simplified. But one lock remains; it is currently cleared by 
calling SVC_STAT().

Instead, merge them into a replacement SVC_RECVARGS(), leaving SVC_RECV() for 
transport parallelism. The existing callback parameter can be repurposed to 
handle asynchronous I-O. It will not be called until all data for a request is 
present. That is, there 
will always be at least one task switch before the callback is handled.

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Nfs-ganesha-devel mailing list
Nfs-ganesha-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs-ganesha-devel

Reply via email to