WHAT: a) Clarify the actual max MPI payload size for eager messages
      (i.e., the exact meaning of btl_XXX_eager_limit), and b) allow
      network administrators to shape network traffic by publishing
      actual BTL max wire fragment sizes (i.e., MPI max payload size +
      max PML header size + max BTL header size).

WHY: Currently BTL eager_limit values actually have the PML header
     subtracted from them, meaning that the eager_limit is not
     actually the largest MPI message payload size.  Terry and Jeff,
     at least, find this misleading. :-)  Additionally, BTLs may add
     their own (variable-sized) headers beyond the eager_limit size,
     so it's not possible for a network administrator to shape network
     traffic because they don't (can't) know what a BTL's max wire
     fragment size.

WHERE: ompi/pml/{ob1,csum,dr}, and likely all BTLs

TIMEOUT: COB, Friday, 31 July 2009

DESCRIPTION:

In trying to fix the checks for eager_limit in the OB1 PML (per
discussion on the OMPI teleconf this past Tuesday), I've come across a
couple gaps.  This RFC is to get others (mainly Brian Barrett's and
George Bosilca's) opinions on exactly what should be done for issue #1
and the ok for implementing issue #2.

1. The btl_XXX_eager_limit values are the upper payload value from
   each payload, but this must include the PML header.  Hence, the max
   MPI data payload size is (btl_XXX_eager_limit - PML header size);
   but this even depends on which flavor of PML send you are using.
   Terry and Jeff find this misleading.  Specifically, if a user sets
   the eager_limit to 1024 bytes and expects their 256 MPI_INT's to
   fit in an eager message, they're wrong.  Additionally, network
   administrators who try to adjust the eager_limit to fit the max MTU
   size of their networks are unpleasantly surprised because the BTL
   may actually send (btl_XXX_eager_limit + btl_XXX_header_size) bytes
   at a time.  Even worse, the value of btl_XXX_header_size is not
   published anywhere, so a network administrator cannot know if
   they're actually going over the MTU size or not.

   --> Note that we only looked at eager_limit -- similar issues
       likely also exist with btl_XXX_max_send_size, and possibly
       btl_XXX_rdma_pipeline_send_length...?
       btl_XXX_rdma_pipeline_frag_size (i.e., the RDMA size) should be
       ok -- I *think* it's an absolute payload size already.  If you
       don't remember what these names mean, look at the pretty
       picture here:

 http://www.open-mpi.org/faq/?category=openfabrics#large-message-tuning-1.3

   There are two solutions I can think of.  Which should we do?

   a. Pass the (max?) PML header size down into the BTL during
      initialization such that the the btl_XXX_eager_limit can
      represent the max MPI data payload size (i.e., the BTL can size
      its buffers to accommodate its desired max eager payload size,
      its header size, and the PML header size).  Thus, the
      eager_limit can truly be the MPI data payload size -- and easy
      to explain to users.

   b. Stay with the current btl_XXX_eager_limit implementation (which
      OMPI has had for a long, long time) and add the code to check
      for btl_eager_limit less than the pml header size (per this past
      Tuesday's discussion).  This is the minimal distance change.

2. OMPI currently does not publish enough information for a user to
   set eager_limit to be able to do BTL traffic shaping.  That is, one
   really needs to know the (max) BTL header length and the (max) PML
   header length values to be able to calculate the correct
   eager_limit force a specific (max) BTL wire fragment size.  Our
   proposed solution is to have ompi_info print out the (max) PML and
   BTL header sizes.  Regardless of whether 1a) or 1b) is chosen, with
   these two pieces of information, a determined network administrator
   could calculate the max wire fragment size used by OMPI, and
   therefore be able to do at least some of traffic shaping.

--
Jeff Squyres
jsquy...@cisco.com

Reply via email to