WHAT: a) Clarify the actual max MPI payload size for eager messages (i.e., the exact meaning of btl_XXX_eager_limit), and b) allow network administrators to shape network traffic by publishing actual BTL max wire fragment sizes (i.e., MPI max payload size + max PML header size + max BTL header size).
WHY: Currently BTL eager_limit values actually have the PML header subtracted from them, meaning that the eager_limit is not actually the largest MPI message payload size. Terry and Jeff, at least, find this misleading. :-) Additionally, BTLs may add their own (variable-sized) headers beyond the eager_limit size, so it's not possible for a network administrator to shape network traffic because they don't (can't) know what a BTL's max wire fragment size. WHERE: ompi/pml/{ob1,csum,dr}, and likely all BTLs TIMEOUT: COB, Friday, 31 July 2009 DESCRIPTION: In trying to fix the checks for eager_limit in the OB1 PML (per discussion on the OMPI teleconf this past Tuesday), I've come across a couple gaps. This RFC is to get others (mainly Brian Barrett's and George Bosilca's) opinions on exactly what should be done for issue #1 and the ok for implementing issue #2. 1. The btl_XXX_eager_limit values are the upper payload value from each payload, but this must include the PML header. Hence, the max MPI data payload size is (btl_XXX_eager_limit - PML header size); but this even depends on which flavor of PML send you are using. Terry and Jeff find this misleading. Specifically, if a user sets the eager_limit to 1024 bytes and expects their 256 MPI_INT's to fit in an eager message, they're wrong. Additionally, network administrators who try to adjust the eager_limit to fit the max MTU size of their networks are unpleasantly surprised because the BTL may actually send (btl_XXX_eager_limit + btl_XXX_header_size) bytes at a time. Even worse, the value of btl_XXX_header_size is not published anywhere, so a network administrator cannot know if they're actually going over the MTU size or not. --> Note that we only looked at eager_limit -- similar issues likely also exist with btl_XXX_max_send_size, and possibly btl_XXX_rdma_pipeline_send_length...? btl_XXX_rdma_pipeline_frag_size (i.e., the RDMA size) should be ok -- I *think* it's an absolute payload size already. If you don't remember what these names mean, look at the pretty picture here: http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3 There are two solutions I can think of. Which should we do? a. Pass the (max?) PML header size down into the BTL during initialization such that the the btl_XXX_eager_limit can represent the max MPI data payload size (i.e., the BTL can size its buffers to accommodate its desired max eager payload size, its header size, and the PML header size). Thus, the eager_limit can truly be the MPI data payload size -- and easy to explain to users. b. Stay with the current btl_XXX_eager_limit implementation (which OMPI has had for a long, long time) and add the code to check for btl_eager_limit less than the pml header size (per this past Tuesday's discussion). This is the minimal distance change. 2. OMPI currently does not publish enough information for a user to set eager_limit to be able to do BTL traffic shaping. That is, one really needs to know the (max) BTL header length and the (max) PML header length values to be able to calculate the correct eager_limit force a specific (max) BTL wire fragment size. Our proposed solution is to have ompi_info print out the (max) PML and BTL header sizes. Regardless of whether 1a) or 1b) is chosen, with these two pieces of information, a determined network administrator could calculate the max wire fragment size used by OMPI, and therefore be able to do at least some of traffic shaping. -- Jeff Squyres jsquy...@cisco.com