While looking into a possible fix for this problem we should also cleanup in the trunk the leftover from the OMPI_FREE_LIST.
$find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} + ./opal/mca/btl/usnic/btl_usnic_compat.h:161: OMPI_FREE_LIST_GET_MT(list, (item)) ./ompi/mca/pml/bfo/pml_bfo_recvreq.h:89: OMPI_FREE_LIST_GET_MT(&mca_pml_base_recv_requests, item); \ ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:149: OMPI_FREE_LIST_GET_MT(&cm->tasks_free, item); ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:206: OMPI_FREE_LIST_GET_MT(task_list, item); ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:107: OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item); ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:146: OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item); ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:208: OMPI_FREE_LIST_GET_MT(&iboffload->device->frags_free[qp_index], item); ./ompi/mca/bcol/iboffload/bcol_iboffload_qp_info.c:156: OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item); ./ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h:130: OMPI_FREE_LIST_GET_MT(&cm->collfrags_free, item); ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.h:115: OMPI_FREE_LIST_GET_MT(&cm->ml_frags_free, item); I wonder how these are even compiling ... George. On Wed, Sep 16, 2015 at 11:59 AM, George Bosilca <bosi...@icl.utk.edu> wrote: > Alexey, > > This is not necessarily the fix for all cases. Most of the internal uses > of the free_list can easily accommodate to the fact that no more elements > are available. Based on your description of the problem I would assume you > encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called. > In this particular case the problem is that fact that we call > OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal > with the case where the returned item is NULL. In this particular case the > real fix is to use the blocking version of the free_list accessor (similar > to the case for send) OMPI_FREE_LIST_WAIT_MT. > > > It is also possible that I misunderstood your problem. IF the solution > above doesn't work can you describe exactly where the NULL return of the > OMPI_FREE_LIST_GET_MT is creating an issue? > > George. > > > On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <avryzh...@compcenter.org> > wrote: > >> Hi all, >> >> We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE >> support level) where several threads submits a lot of MPI_Irecv() requests >> simultaneously and encountered an intermittent bug >> OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC() >> because OMPI_FREE_LIST_GET_MT() returned NULL. Investigating this bug we >> found that sometimes the thread calling ompi_free_list_grow() don’t have >> any free items in LIFO list at exit because other threads retrieved all >> new items at opal_atomic_lifo_pop() >> >> So we suggest to change OMPI_FREE_LIST_GET_MT() as below: >> >> >> >> #define OMPI_FREE_LIST_GET_MT(fl, item) >> \ >> >> { >> \ >> >> item = (ompi_free_list_item_t*) >> opal_atomic_lifo_pop(&((fl)->super)); \ >> >> if( OPAL_UNLIKELY(NULL == item) ) >> { \ >> >> if(opal_using_threads()) >> { \ >> >> int rc; >> \ >> >> >> opal_mutex_lock(&((fl)->fl_lock)); \ >> >> >> do \ >> >> { >> \ >> >> rc = ompi_free_list_grow((fl), >> (fl)->fl_num_per_alloc); \ >> >> if( OPAL_UNLIKELY(rc != OMPI_SUCCESS)) >> break; \ >> >> >> \ >> >> item = (ompi_free_list_item_t*) >> opal_atomic_lifo_pop(&((fl)->super)); \ >> >> >> \ >> >> } while >> (!item); \ >> >> >> opal_mutex_unlock(&((fl)->fl_lock)); \ >> >> } else >> { \ >> >> ompi_free_list_grow((fl), >> (fl)->fl_num_per_alloc); \ >> >> item = (ompi_free_list_item_t*) >> opal_atomic_lifo_pop(&((fl)->super)); \ >> >> } /* opal_using_threads() */ >> \ >> >> } /* NULL == item >> */ \ >> >> } >> >> >> >> >> >> Another workaround is to increase the value of pml_ob1_free_list_inc >> parameter. >> >> >> >> Regards, >> >> Alexey >> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel >> Link to this post: >> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php >> > >