Re: [Pvfs2-developers] Re: noncontig-test

Sam Lang Wed, 08 Aug 2007 19:19:36 -0700


On Aug 8, 2007, at 8:09 PM, Scott Atchley wrote:

Hi Sam,

Welcome back!


Hi Scott,

Thanks!

I copied EINVAL from ib.c. I can change it to return BMI_EMSGSIZEsince it more specific.

Ah ok. EMSGSIZE might help us know what the problem is if/when ithappens again.

The trade-off to 32-64 KB unexpected size is memory footprint. Toavoid mallocing a buffer for each incoming unexpected, I pre-alloc100 rx structs on the server to catch initial connect messages. Foreach peer that connects, I alloc another 20 rx structs. For each rxstruct, I alloc two buffers of unexpected size (so I can repost therx after handing the first buffer off in BMI_mx_testunexpected()).This behavior can be changed but it really helps reduce latency.
Also, since 32 KB is the starting size for rendezvous messages inMX, I would prefer to use 8 or 16 KB so that the unexpected messageis actually sent eagerly.


Can we make it 16K?

As for number of segments, MX will accept up to 256. For messagesless than 32 KB, there is not a penalty since eager messages arebuffered before sending on the wire. For messages over 32 KB usingmore than one segment, MX will copy them into a contiguous bufferbefore sending, which will greatly reduce throughput.

Sorry - bad terminology there. By 'segments', I meant base types ofa PVFS datatype (called a PVFS Request). They all get encoded intothe same buffer.


-sam

Scott

On Aug 8, 2007, at 7:47 PM, Sam Lang wrote:
Hi All,
Sorry for not being around earlier to participate in thisdiscussion. I agree that the bits of code inio_find_target_datafiles are nasty and I'll be sure to clean thatup, but the cause of this bug isn't with small IO. The checktotal_bytes <= max_unexp_payload will still do the right thing(not enable small IO) even if max_unexp_payload is negative. Infact, the problem occurs because the noncontig request makes thenormal IO request larger than 4K, and when the sys-io statemachine tries to post the unexpected send, the BMI mx layerreturns EINVAL because therequest is larger than its specified unexpected size (see mx.c:1391).
We do the same thing in other BMI methods (gm and ib), but theunexpected limits there are bigger (8K for ib, 16K for gm, 16K fortcp), and so we've never actually hit them with unexpectedrequests. With a large indexed request, I think we would see thesame errors with ib and gm, unless we hit the limit of max requestsegments first. Each individual segment of an indexed requesttake 80 bytes, so we would need to have an indexed request withabout 100 segments before hitting the max for ib, and somewherearound 200 for tcp and gm. The limit of of segments for a requestis hardcoded to 100 right now.
At this point, It seems like the best fix is the one Scott chose,just increase the max unexpected size for MX. The alternative isto split up unexpected requests if they're above a certain sizeinto multiple unexpecteds, and join them on the server. Messy anda lot of work. If the unexpected message size isn't cardspecific, could we make it something like 32K or 64K? Are theredrawbacks to making it that big?
Also, should we increase the limit of request segments allowed?It might be inefficient for a user to create an MPI indexeddataype with that many elements, but there are users that willprobably do it anyway. Alternatively, we could consider moreefficiently encoding each request segment in PVFS.
As an aside, the other methods return -EMSGSIZE, mx is returningEINVAL, which may have made this harder to debug.
-sam

On Aug 7, 2007, at 3:21 PM, Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Tue, 07 Aug 2007 15:21 -0400:
I assumed that small_io_size is fixed in PVFS2 and was greaterthan 4KB, which why I volunteered to change bmi_mx. I chose 4 KB forbmi_mx
simply because that was the value I used in Lustre (kernel page
size). I am not wedded to 4 KB.
Okay, good reasoning.  We'll let Sam tell us what he thinks.  He did
the small io work.  I can't think of a reason why any device could
not support a minimum of 8k, like you say, if that would make more
sense for the small io implementation.

                -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers



_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] Re: noncontig-test

Reply via email to