Hello!

On Dec 22, 2010, at 12:43 AM, Jeremy Filizetti wrote:

> In the attachment I created that Andreas posted at 
> https://bugzilla.lustre.org/attachment.cgi?id=31423 if you look at graph 1 
> and 2 they are both using larger than default max_rpcs_in_flight.  I believe 
> the data without the patch from bug 16900 had max_rpcs_in_flight=42.  For the 
> data with the patch from 16900 used max_rpcs_in_flight=32.  So the short 
> answer is we are already increasing max_rpcs_in_flight for all of that data 
> (which is needed for good performance at higher latencies).

Ah! This should have been noted somewhere.
Well, it's still unfair then! ;)
You see, each OSC can cache up to 32mb of dirty data by default (max_dirty_mb 
osc setting in /proc).
So when you have 4M RPCs, you actually use only 8 RPCs to transfer your entire 
allotment of dirty pages where as you use 32 for 1M RPCs (and so setting it any 
higher has no effect unless you also bump max dirty mb). Of course this will 
only affect the write RPCs, not read.

> My understanding of what the real benefit is from the larger RPC patch is 
> that we are not having to face 12 round-trip-times to read 4 MB (4 - 1 MB 
> bulk RPCs) instead I think we have 3.  Although I've never traced through to 
> see this is actually what is happening.  But from what I read about the patch 
> it sends 4 memory descriptors with a single bulk request.

Well, I don't think this should matter anyhow. Since we send the RPCs in async 
manner in parallel anyway, the latency of bulk descriptor get is not adding up.
Due to that the results you've got should have been much closer together too. I 
wonder what other factors played a role here?
I see you only had single client, so it's not like you were able to overwhelm 
the number of OSS threads running. Even with the case of 6 OSTs per oss 
assuming all 42 RPCs were in flight, that's still only 252 RPCs. did you make 
sure that that's the number of threads you had running by any chance?
How does your rtt delay gets introduced for the test? Could it be that if there 
are more messages on the wire even at the same time, they are delayed more 
(aside from obvious bandwidth-induced delay, like bottlenecking a single 
message at a time with mandatory delay or something like this)?

> What isn't quite clear to me is why Lustre takes 3 RTT for a read and 2 for a 
> write.  I think I understand the write having to communicate once with the 
> server because preallocating buffers for all clients would possible be a 
> waste of resources.  But for reading it seems logical (from the RDMA stand 
> point) that the memory buffer could be pre-registered and send to the server 
> and the server would respond back with the contents for that buffer for a 
> read which would be 1 RTT.

Probably the difference is the one of GET vs PUT semantic in lnet. there's 
going to be at least 2 RTTs in any case. One RTT is the "header" RPC that tells 
OST "hey, I am doing this operation
here that involves bulk io, it has this many pages and the descriptor is so and 
so", then server does another RTT to actually fetch/push the data (and that 
might actually be worse than one for one of the GET/PUT case I guess?)

> I don't have everything setup right now in our test environment but with a 
> little effort I could setup a similar test if your wondering about something 
> specific.

Would be interesting to confirm the amount of RPCs actually being processed on 
the server at any one time, I think.
Did you try the direct IO too? Some older version of lustre used to send all 
outstanding directio RPCs in parallel, so if you did your IO as just a single 
direct IO write, the latency of that write should be around a couple of RTTs. I 
think that we still do this even in 1.8.5, so it would make an interesting 
comparison.

Bye,
    Oleg
_______________________________________________
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss

Reply via email to