iboffload and bfo are opal ignored by default. Neither exists in the release branch.
-Nathan On Wed, Sep 16, 2015 at 12:02:29PM -0400, George Bosilca wrote: > While looking into a possible fix for this problem we should also cleanup > in the trunk the leftover from the OMPI_FREE_LIST. > $find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} + > ./opal/mca/btl/usnic/btl_usnic_compat.h:161: > OMPI_FREE_LIST_GET_MT(list, (item)) > ./ompi/mca/pml/bfo/pml_bfo_recvreq.h:89: > OMPI_FREE_LIST_GET_MT(&mca_pml_base_recv_requests, item); \ > ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:149: > OMPI_FREE_LIST_GET_MT(&cm->tasks_free, item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:206: > OMPI_FREE_LIST_GET_MT(task_list, item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:107: > OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:146: > OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:208: > OMPI_FREE_LIST_GET_MT(&iboffload->device->frags_free[qp_index], item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_qp_info.c:156: > OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h:130: > OMPI_FREE_LIST_GET_MT(&cm->collfrags_free, item); > ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.h:115: > OMPI_FREE_LIST_GET_MT(&cm->ml_frags_free, item); > I wonder how these are even compiling ... > George. > On Wed, Sep 16, 2015 at 11:59 AM, George Bosilca <bosi...@icl.utk.edu> > wrote: > > Alexey, > This is not necessarily the fix for all cases. Most of the internal uses > of the free_list can easily accommodate to the fact that no more > elements are available. Based on your description of the problem I would > assume you encounter this problem once the > MCA_PML_OB1_RECV_REQUEST_ALLOC is called. In this particular case the > problem is that fact that we call OMPI_FREE_LIST_GET_MT and that the > upper level is unable to correctly deal with the case where the returned > item is NULL. In this particular case the real fix is to use the > blocking version of the free_list accessor (similar to the case for > send) OMPI_FREE_LIST_WAIT_MT. > It is also possible that I misunderstood your problem. IF the solution > above doesn't work can you describe exactly where the NULL return of the > OMPI_FREE_LIST_GET_MT is creating an issue? > George. > On Wed, Sep 16, 2015 at 9:03 AM, Aleksej Ryzhih > <avryzh...@compcenter.org> wrote: > > Hi all, > > We experimented with MPI+OpenMP hybrid application > (MPI_THREAD_MULTIPLE support level) where several threads submits a > lot of MPI_Irecv() requests simultaneously and encountered an > intermittent bug OMPI_ERR_TEMP_OUT_OF_RESOURCE after > MCA_PML_OB1_RECV_REQUEST_ALLOC() because OMPI_FREE_LIST_GET_MT() > returned NULL. Investigating this bug we found that sometimes the > thread calling ompi_free_list_grow() don't have any free items in > LIFO list at exit because other threads retrieved all new items at > opal_atomic_lifo_pop() > > So we suggest to change OMPI_FREE_LIST_GET_MT() as below: > > > > #define OMPI_FREE_LIST_GET_MT(fl, > item) \ > > > { > \ > > item = (ompi_free_list_item_t*) > opal_atomic_lifo_pop(&((fl)->super)); \ > > if( OPAL_UNLIKELY(NULL == item) ) > { \ > > if(opal_using_threads()) > { \ > > int rc; > \ > > > opal_mutex_lock(&((fl)->fl_lock)); > > \ > > > do > > \ > > { > \ > > rc = ompi_free_list_grow((fl), > (fl)->fl_num_per_alloc); \ > > if( OPAL_UNLIKELY(rc != OMPI_SUCCESS)) > break; \ > > > \ > > item = (ompi_free_list_item_t*) > opal_atomic_lifo_pop(&((fl)->super)); \ > > > > \ > > } while > (!item); \ > > > opal_mutex_unlock(&((fl)->fl_lock)); > > \ > > } else > { > \ > > ompi_free_list_grow((fl), > (fl)->fl_num_per_alloc); \ > > item = (ompi_free_list_item_t*) > opal_atomic_lifo_pop(&((fl)->super)); \ > > } /* opal_using_threads() */ > \ > > } /* NULL == item > */ \ > > } > > > > > > Another workaround is to increase the value of pml_ob1_free_list_inc > parameter. > > > > Regards, > > Alexey > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18039.php > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/09/18046.php
pgpzph1au1DXK.pgp
Description: PGP signature