Andreas Dilger wrote: > On 2010-08-17, at 14:15, burlen wrote: > >> I have some question about Lustre RPC and the sequence of events that >> occur during large concurrent write() involving many processes and large >> data size per process. I understand there is a mechanism of flow >> control by credits, but I'm a little unclear on how it works in general >> after reading the "networking & io protocol" white paper. >> > > There are different levels of flow control. There is one at the LNET level, > that controls low-level messages from overwhelming the server with messages, > and avoiding stalling small/reply messages at the back of a deep queue of > requests. > > >> Is it true that a write() RPC transfer's data in chunks of at least 1MB >> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? >> I can use the bounds to estimate the number of RPCs issued per MB of >> data to write? >> > > Currently, 1MB is the largest bulk IO size, and is the typical size used by > clients for all IO. > > >> About how many concurrent incoming write() RPC per OSS service thread >> can a single server handle before it stops responding to incoming RPCs ? >> > > The server can handle tens of thousands of write _requests_, but note that > since Lustre has always been designed as an RDMA-capable protocol the request > is relatively small (a few hundreds of bytes) and does not contain any of the > DATA. > > When one of the server threads is ready to process a read/write request it > will get or put the data from/to the buffers that the client already > prepared. The number of currently active IO requests is exactly the number > of active service threads (up to 512 by default). > > >> What happens to an RPC when the server is too busy to handle it, is it >> even issued by the client ? Does the client have to poll and/or resend >> the RPC ? Does the process of polling for flow control credits add >> significant network/server congestion ? >> > > The clients limit the number of concurrent RPC requests, by default to 8 per > OST. The LNET level message credits will also limit the number of in-flight > messages in case there is e.g. an LNET router between the client and server. > > The client will almost never time out a request, as it is informed how long > requests are currently taking to process and will wait patiently for its > earlier requests to finish processing. If the client is going to time out a > request (based on an earlier request timeout that is about to be exceeded) > the server will inform it to continue waiting and give it a new processing > time estimate (unless of course the server is non-functional or so > overwhelmed that it can't even do that). > > >> Is it likely that a large number of RPC's/flow control credit requests >> will induce enough network congestion so that client's RPC's timeout ? >> How does the client handle such a timeout ? >> > > Since the flow control credits are bounded, and will be returned to the peer > as earlier requests complete there is not additional traffic due to this. > However, considering that HPC clusters are distributed denial-of-service > engines it is always possible to overwhelm the server under some conditions. > In case of a client RPC timeout (hundreds of seconds under load) the client > will resend the request and/or try to contact the backup server until one > responds. > Thank you for you help.
Is my understanding correct? A single RPC request will initiate an RDMA transfer of at most "max_pages_per_rpc". where the page unit is Lustre page size 65536. Each RDMA transfer is executed in 1MB chunks. On a given client, if there are more than "max_pages_per_rpc" pages of data available to transfer , multiple RPCs are issued and multiple RDMA's are initiated. Would it be correct to say: The purpose of the "max_pages_per_rpc" parameter is to enable the servers to even out the individual progress of concurrent clients with a lot of data to move and more fairly apportion the available bandwidth amongst concurrently writing clients? _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
