ok, r21142 should fix the problem for the app. I did test it with a number of scenarios (e.g. all intra-comm cases, inter-comm cases, intercomm_merge etc.), but I would suggest to let at least one night of MTT runs go over it before we file a CMR for 1.3 ...

Thanks
Edgar


On Fri, May 1, 2009 at 6:38 AM, Ralph Castain <r...@open-mpi.org> wrote:
      I'm not entirely sure if David is going to be in today, so I
      will answer for him (and let him correct me later!).

      This code is indeed representative of what the app is doing.
      Basically, the user repeatedly splits the communicator so he
      can run mini test cases before going on to the larger
      computation. So it is always the base communicator being
      repeatedly split and freed.

      I would suspect, therefore, that the quick fix would serve us
      just fine while the worst case is later resolved.

      Thanks
      Ralph


On Fri, May 1, 2009 at 6:08 AM, Edgar Gabriel <gabr...@cs.uh.edu>
wrote:
      David,

      is this code representative for what your app is doing?
      E.g. you have a base communicator (e.g. MPI_COMM_WORLD)
      which is being 'split', freed again, split, freed again
      etc. ? i.e. the important aspect is that the same
      'base' communicator is being used for deriving new
      communicators again and again?

      The reason I ask is two-fold: one, you would in that
      case be one of the ideal beneficiaries of the block cid
      algorithm :-) (even if it fails you right now);  two, a
      fix for this scenario which basically tries to reuse
      the last block used (and which would fix your case if
      the condition is true) is roughly five lines of code.
      This would give us the possibility to have a fix
      quickly in the trunk and v1.3 (keep in mind that the
      block-cid code is in the trunk since two years and this
      is the first problem that we have) and give us more
      time to develop a profound solution for the worst case
      - a chain of communicators being created, e.g.
      communicator 1 is basis to derive a new comm 2, comm 2
      is being used to derive comm 3 etc.

      Thanks
      Edgar

      David Gunter wrote:
            Here is the test code reproducer:

                 program test2
                 implicit none
                 include 'mpif.h'
                 integer ierr, myid,
            numprocs,i1,i2,n,local_comm,
                $     icolor,ikey,rank,root

            c
            c...  MPI set-up
                 ierr = 0
                 call MPI_INIT(IERR)
                 ierr = 1
                 CALL MPI_COMM_SIZE(MPI_COMM_WORLD,
            numprocs, ierr)
                 print *, ierr

                 ierr = -1

                 CALL MPI_COMM_RANK(MPI_COMM_WORLD,
            myid, ierr)

                 ierr = -5
                 i1 = ierr
                 if (myid.eq.0) i1 = 1
                 call mpi_allreduce(i1, i2,
            1,MPI_integer,MPI_MIN,
                $     MPI_COMM_WORLD,ierr)

                 ikey = myid
                 if (mod(myid,2).eq.0) then
                    icolor = 0
                 else
                    icolor = MPI_UNDEFINED
                 end if

                 root = 0
                 do n = 1, 100000

                    call MPI_COMM_SPLIT(MPI_COMM_WORLD,
            icolor,
                $        ikey, local_comm, ierr)

                    if (mod(myid,2).eq.0) then
                       CALL MPI_COMM_RANK(local_comm,
            rank, ierr)
                       i2 = i1
                       call mpi_reduce(i1, i2,
            1,MPI_integer,MPI_MIN,
                $           root, local_comm,ierr)

                       if
            (myid.eq.0.and.mod(n,10).eq.0)
                $           print *, n, i1,
            i2,icolor,ikey

                       call mpi_comm_free(local_comm,
            ierr)
                    end if

                 end do
            c      if (icolor.eq.0) call
            mpi_comm_free(local_comm, ierr)



                 call MPI_barrier(MPi_COMM_WORLD,ierr)

                 call MPI_FINALIZE(IERR)

                 print *, myid, ierr

                 end



            -david
            --
            David Gunter
            HPC-3: Parallel Tools Team
            Los Alamos National Laboratory



            On Apr 30, 2009, at 12:43 PM, David Gunter
            wrote:

                  Just to throw out more info on
                  this, the test code runs fine
                  on previous versions of OMPI.
                   It only hangs on the 1.3 line
                  when the cid reaches 65536.

                  -david
                  --
                  David Gunter
                  HPC-3: Parallel Tools Team
                  Los Alamos National Laboratory



                  On Apr 30, 2009, at 12:28 PM,
                  Edgar Gabriel wrote:

                        cid's are in fact
                        not recycled in the
                        block algorithm.
                        The problem is that
                        comm_free is not
                        collective, so you
                        can not make any
                        assumptions whether
                        other procs have
                        also released that
                        communicator.


                        But nevertheless, a
                        cid in the
                        communicator
                        structure is a
                        uint32_t, so it
                        should not hit the
                        16k limit there
                        yet. this is not
                        new, so if there is
                        a discrepancy
                        between what the
                        comm structure
                        assumes that a cid
                        is and what the pml
                        assumes, than this
                        was in the code
                        since the very
                        first days of Open
                        MPI...

                        Thanks
                        Edgar

                        Brian W. Barrett
                        wrote:
                              On Thu,
                              30 Apr
                              2009,
                              Ralph
                              Castain
                              wrote:
                                    We
                                    seem
                                    to
                                    have
                                    hit
                                    a
                                    problem
                                    here
                                    -
                                    it
                                    looks
                                    like
                                    we
                                    are
                                    seeing
                                    a
                                    built-in
                                    limit
                                    on
                                    the
                                    number
                                    of
                                    communicators
                                    one
                                    can
                                    create
                                    in
                                    a
                                    program.
                                    The
                                    program
                                    basically
                                    does
                                    a
                                    loop,
                                    calling
                                    MPI_Comm_split
                                    each
                                    time
                                    through
                                    the
                                    loop
                                    to
                                    create
                                    a
                                    sub-communicator,
                                    does
                                    a
                                    reduce
                                    operation
                                    on
                                    the
                                    members
                                    of
                                    the
                                    sub-communicator,
                                    and
                                    then
                                    calls
                                    MPI_Comm_free
                                    to
                                    release
                                    it
                                    (this
                                    is
                                    a
                                    minimized
                                    reproducer
                                    for
                                    the
                                    real
                                    code).
                                    After
                                    64k
                                    times
                                    through
                                    the
                                    loop,
                                    the
                                    program
                                    fails.

                                    This
                                    looks
                                    remarkably
                                    like
                                    a
                                    16-bit
                                    index
                                    that
                                    hits
                                    a
                                    max
                                    value
                                    and
                                    then
                                    blocks.

                                    I
                                    have
                                    looked
                                    at
                                    the
                                    communicator
                                    code,
                                    but
                                    I
                                    don't
                                    immediately
                                    see
                                    such
                                    a
                                    field.
                                    Is
                                    anyone
                                    aware
                                    of
                                    some
                                    other
                                    place
                                    where
                                    we
                                    would
                                    have
                                    a
                                    limit
                                    that
                                    would
                                    cause
                                    this
                                    problem?

                              There's
                              a
                              maximum
                              of
                              32768
                              communicator
                              ids
                              when
                              using
                              OB1
                              (each
                              PML can
                              set the
                              max
                              contextid,
                              although
                              the
                              communicator
                              code is
                              the
                              part
                              that
                              actually
                              assigns
                              a cid).
                               Assuming
                              that
                              comm_free
                              is
                              actually
                              properly
                              called,
                              there
                              should
                              be
                              plenty
                              of cids
                              available
                              for
                              that
                              pattern.
                              However,
                              I'm not
                              sure I
                              understand
                              the
                              block
                              algorithm
                              someone
                              added
                              to cid
                              allocation
                              - I'd
                              have to
                              guess
                              that
                              there's
                              something
                              funny
                              with
                              that
                              routine
                              and
                              cids
                              aren't
                              being
                              recycled
                              properly.
                              Brian
_______________________________________________
                              devel
                              mailing
                              list
                              de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


                        --
                        Edgar Gabriel
                        Assistant Professor
                        Parallel Software
Technologies Lab http://pstl.cs.uh.edu
                        Department of
Computer Science University
                        of Houston
                        Philip G. Hoffman
Hall, Room 524 Houston,
                        TX-77204, USA
                        Tel: +1 (713)
743-3857 Fax: +1
                        (713) 743-3335
                        _______________________________________________
                        devel mailing list
                        de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


                  _______________________________________________
                  devel mailing list
                  de...@open-mpi.org
                  http://www.open-mpi.org/mailman/listinfo.cgi/devel


            _______________________________________________
            devel mailing list
            de...@open-mpi.org
            http://www.open-mpi.org/mailman/listinfo.cgi/devel


      --
      Edgar Gabriel
      Assistant Professor
Parallel Software Technologies Lab http://pstl.cs.uh.edu
      Department of Computer Science          University of
      Houston
      Philip G. Hoffman Hall, Room 524        Houston,
      TX-77204, USA
      Tel: +1 (713) 743-3857                  Fax: +1 (713)
      743-3335
      _______________________________________________
      devel mailing list
      de...@open-mpi.org
      http://www.open-mpi.org/mailman/listinfo.cgi/devel






------------------------------------------------------------------------

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab      http://pstl.cs.uh.edu
Department of Computer Science          University of Houston
Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335

Reply via email to