Re: [Lustre-discuss] write RPC & congestion

burlen Sun, 22 Aug 2010 10:59:15 -0700

Andreas Dilger wrote:
> On 2010-08-17, at 14:15, burlen wrote:
>   
>> I have some question about Lustre RPC and the sequence of events that 
>> occur during large concurrent write() involving many processes and large 
>> data size per process.  I understand there is a mechanism of flow 
>> control by credits, but I'm a little unclear on how it works in general 
>> after reading the "networking & io protocol" white paper.
>>     
>
> There are different levels of flow control.  There is one at the LNET level, 
> that controls low-level messages from overwhelming the server with messages, 
> and avoiding stalling small/reply messages at the back of a deep queue of 
> requests.
>
>   
>> Is it true that a write() RPC transfer's data in chunks of at least 1MB 
>> and at most (max_pages_per_rpc*page_size) Bytes, where page_size=2^16 ? 
>> I can use the bounds to estimate the number of RPCs issued per MB of 
>> data to write?
>>     
>
> Currently, 1MB is the largest bulk IO size, and is the typical size used by 
> clients for all IO.
>
>   
>> About how many concurrent incoming write() RPC per OSS service thread 
>> can a single server handle before it stops responding to incoming RPCs ?
>>     
>
> The server can handle tens of thousands of write _requests_, but note that 
> since Lustre has always been designed as an RDMA-capable protocol the request 
> is relatively small (a few hundreds of bytes) and does not contain any of the 
> DATA.
>
> When one of the server threads is ready to process a read/write request it 
> will get or put the data from/to the buffers that the client already 
> prepared.  The number of currently active IO requests is exactly the number 
> of active service threads (up to 512 by default).
>
>   
>> What happens to an RPC when the server is too busy to handle it, is it 
>> even issued by the client ? Does the client have to poll and/or resend 
>> the RPC ? Does the process of polling for flow control credits add 
>> significant network/server congestion ?
>>     
>
> The clients limit the number of concurrent RPC requests, by default to 8 per 
> OST.  The LNET level message credits will also limit the number of in-flight 
> messages in case there is e.g. an LNET router between the client and server.
>
> The client will almost never time out a request, as it is informed how long 
> requests are currently taking to process and will wait patiently for its 
> earlier requests to finish processing.  If the client is going to time out a 
> request (based on an earlier request timeout that is about to be exceeded) 
> the server will inform it to continue waiting and give it a new processing 
> time estimate (unless of course the server is non-functional or so 
> overwhelmed that it can't even do that).
>
>   
>> Is it likely that a large number of RPC's/flow control credit requests 
>> will induce enough network congestion so that client's RPC's timeout ? 
>> How does the client handle such a timeout ?
>>     
>
> Since the flow control credits are bounded, and will be returned to the peer 
> as earlier requests complete there is not additional traffic due to this.  
> However, considering that HPC clusters are distributed denial-of-service 
> engines it is always possible to overwhelm the server under some conditions.  
> In case of a client RPC timeout (hundreds of seconds under load) the client 
> will resend the request and/or try to contact the backup server until one 
> responds.
>   
Thank you for you help.


Is my understanding correct?

A single RPC request will initiate an RDMA transfer of at most 
"max_pages_per_rpc". where the page unit is Lustre page size 65536. Each 
RDMA transfer is executed in 1MB chunks.  On a given client, if there 
are more than "max_pages_per_rpc" pages of data available to transfer , 
multiple RPCs are issued and multiple RDMA's are initiated.

Would it be correct to say: The purpose of the "max_pages_per_rpc" 
parameter is to enable the servers to even out the individual progress 
of concurrent clients with a lot of data to move and more fairly 
apportion the available bandwidth amongst concurrently writing clients?




_______________________________________________
Lustre-discuss mailing list
[email protected]
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Re: [Lustre-discuss] write RPC & congestion

Reply via email to