On Tue, Sep 11, 2007 at 10:00:07AM -0500, Edgar Gabriel wrote:
> Gleb,
> 
> in the scenario which you describe in the comment to the patch, what 
> should happen is, that the communicator with the cid which started 
> already the allreduce will basically 'hang' until the other processes 
> 'allow' the lower cids to continue. It should basically be blocked in 
> the allreduce.
Why? Two threads are allowed to run allreduce simultaneously for different
communicators. Are they?

> 
> However, here is something, where we might have problems with the sun 
> thread tests (and we discussed that with Terry already): the cid 
> allocation algorithm as implemented in Open MPI assumes ( -- this was/is 
> my/our understanding of the standard --) that a communicator creation is 
> a collective operation. This means, you can not have a comm_create and 
> another allreduce of the same communicator running in different threads, 
> because these allreduces will mix up and produce non-sense results. We 
> fixed the case, if all collective operations are comm_creates, but if 
> some of the threads are in a comm_create and some are in allreduce on 
> the same communicator, it won't work.
Correct, but this is not what happens with mt_coll test. mt_coll calls
commdup on the same communicator in different threads concurrently, but
we handle this case inside ompi_comm_nextcid().

> 
> 
> Gleb Natapov wrote:
> > On Tue, Sep 11, 2007 at 10:14:30AM -0400, George Bosilca wrote:
> >> Gleb,
> >>
> >> This patch is not correct. The code preventing the registration of the 
> >> same 
> >> communicator twice is later in the code (same file in the function 
> >> ompi_comm_register_cid line 326). Once the function ompi_comm_register_cid 
> > I saw this code and the comment. The problem is not with the same
> > communicator but with different communicators.
> > 
> >> is called, we know that each communicator only handle one "communicator 
> >> creation" function at the same time. Therefore, we want to give priority 
> >> to 
> >> the smallest com_id, which is what happens in the code you removed.
> > The code I removed was doing it wrongly. I.e the algorithm sometimes is 
> > executed
> > for different communicators simultaneously by different threads. Think
> > about the case where the function is running for cid 1 and then another
> > thread runs it for cid 0. cid 0 will proceed although the function is
> > executed on another CPU. And this is not something theoretical, that
> > is happening with sun's thread test suit mpi_coll test case.
> > 
> >> Without the condition in the ompi_comm_register_cid (each communicator 
> >> only 
> >> get registered once) your comment make sense. However, with the condition 
> >> your patch allow a dead end situation, while 2 processes try to create 
> >> communicators in multiple threads, and they will never succeed, simply 
> >> because they will not order the creation based on the com_id.
> > If the algorithm is really prone to deadlock in case it is concurrently
> > executed for several different communicators (I haven't check this),
> > then we may want to fix original code to really prevent two threads to
> > enter the function, but then I don't see the reason for all those
> > complications with ompi_comm_register_cid()/ompi_comm_unregister_cid()
> > The algorithm described here:
> > http://209.85.129.104/search?q=cache:5PV5MMRkBWkJ:ftp://info.mcs.anl.gov/pub/tech_reports/reports/P1382.pdf+MPI+communicator+dup+algorithm&hl=en&ct=clnk&cd=2
> > in section 5.3 works without it and we can do something similar.
> > 
> >>   george.
> >>
> >>
> >>
> >> On Sep 11, 2007, at 9:23 AM, g...@osl.iu.edu wrote:
> >>
> >>> Author: gleb
> >>> Date: 2007-09-11 09:23:46 EDT (Tue, 11 Sep 2007)
> >>> New Revision: 16088
> >>> URL: https://svn.open-mpi.org/trac/ompi/changeset/16088
> >>>
> >>> Log:
> >>> The code tries to prevent itself from running for more then one 
> >>> communicator
> >>> simultaneously, but is doing it incorrectly. If the function is running 
> >>> already
> >>> for one communicator and it is called from another thread for other 
> >>> communicator
> >>> with lower cid the check comm->c_contextid != ompi_comm_lowest_cid()
> >>> will fail and the function will be executed for two different 
> >>> communicators by
> >>> two threads simultaneously. There is nothing in the algorithm that 
> >>> prevent 
> >>> it
> >>> from been running simultaneously for different communicators as far as I 
> >>> can see,
> >>> but ompi_comm_unregister_cid() assumes that it is always called for a 
> >>> communicator
> >>> with the lowest cid and this is not always the case. This patch removes 
> >>> bogus
> >>> lowest cid check and fix ompi_comm_register_cid() to properly remove cid 
> >>> from
> >>> the list.
> >>>
> >>> Text files modified:
> >>>    trunk/ompi/communicator/comm_cid.c |    24 ++++++++++++------------
> >>>    1 files changed, 12 insertions(+), 12 deletions(-)
> >>>
> >>> Modified: trunk/ompi/communicator/comm_cid.c
> >>> ==============================================================================
> >>> --- trunk/ompi/communicator/comm_cid.c    (original)
> >>> +++ trunk/ompi/communicator/comm_cid.c    2007-09-11 09:23:46 EDT (Tue, 
> >>> 11 
> >>> Sep 2007)
> >>> @@ -11,6 +11,7 @@
> >>>   *                         All rights reserved.
> >>>   * Copyright (c) 2006-2007 University of Houston. All rights reserved.
> >>>   * Copyright (c) 2007      Cisco, Inc.  All rights reserved.
> >>> + * Copyright (c) 2007      Voltaire All rights reserved.
> >>>   * $COPYRIGHT$
> >>>   *
> >>>   * Additional copyrights may follow
> >>> @@ -170,15 +171,6 @@
> >>>               * This is the real algorithm described in the doc
> >>>               */
> >>>
> >>> -            OPAL_THREAD_LOCK(&ompi_cid_lock);
> >>> -            if (comm->c_contextid != ompi_comm_lowest_cid() ) {
> >>> -                /* if not lowest cid, we do not continue, but sleep and 
> >>> try again */
> >>> -                OPAL_THREAD_UNLOCK(&ompi_cid_lock);
> >>> -                continue;
> >>> -            }
> >>> -            OPAL_THREAD_UNLOCK(&ompi_cid_lock);
> >>> -
> >>> -
> >>>              for (i=start; i < mca_pml.pml_max_contextid ; i++) {
> >>>                  
> >>> flag=ompi_pointer_array_test_and_set_item(&ompi_mpi_communicators,
> >>>                                                            i, comm);
> >>> @@ -365,10 +357,18 @@
> >>>
> >>>  static int ompi_comm_unregister_cid (uint32_t cid)
> >>>  {
> >>> -    ompi_comm_reg_t *regcom=NULL;
> >>> -    opal_list_item_t 
> >>> *item=opal_list_remove_first(&ompi_registered_comms);
> >>> +    ompi_comm_reg_t *regcom;
> >>> +    opal_list_item_t *item;
> >>>
> >>> -    regcom = (ompi_comm_reg_t *) item;
> >>> +    for (item = opal_list_get_first(&ompi_registered_comms);
> >>> +         item != opal_list_get_end(&ompi_registered_comms);
> >>> +         item = opal_list_get_next(item)) {
> >>> +        regcom = (ompi_comm_reg_t *)item;
> >>> +        if(regcom->cid == cid) {
> >>> +            opal_list_remove_item(&ompi_registered_comms, item);
> >>> +            break;
> >>> +        }
> >>> +    }
> >>>      OBJ_RELEASE(regcom);
> >>>      return OMPI_SUCCESS;
> >>>  }
> >>> _______________________________________________
> >>> svn-full mailing list
> >>> svn-f...@open-mpi.org
> >>> http://www.open-mpi.org/mailman/listinfo.cgi/svn-full
> > 
> > 
> > 
> >> _______________________________________________
> >> devel mailing list
> >> de...@open-mpi.org
> >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > --
> >                     Gleb.
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> -- 
> Edgar Gabriel
> Assistant Professor
> Parallel Software Technologies Lab      http://pstl.cs.uh.edu
> Department of Computer Science          University of Houston
> Philip G. Hoffman Hall, Room 524        Houston, TX-77204, USA
> Tel: +1 (713) 743-3857                  Fax: +1 (713) 743-3335
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Reply via email to