While looking into a possible fix for this problem we should also cleanup
in the trunk the leftover from the OMPI_FREE_LIST.
$find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} +
./opal/mca/btl/usnic/btl_usnic_compat.h:161: OMPI_FREE_LIST_GET_MT(list,
(item))
./ompi/mca/pml/bfo/pml_bfo_recvreq.h:89:
OMPI_FREE_LIST_GET_MT(&mca_pml_base_recv_requests, item); \
./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:149:
OMPI_FREE_LIST_GET_MT(&cm->tasks_free, item);
./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:206:
OMPI_FREE_LIST_GET_MT(task_list, item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:107:
OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:146:
OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:208:
OMPI_FREE_LIST_GET_MT(&iboffload->device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_qp_info.c:156:
OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h:130:
OMPI_FREE_LIST_GET_MT(&cm->collfrags_free, item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.h:115:
OMPI_FREE_LIST_GET_MT(&cm->ml_frags_free, item);
I wonder how these are even compiling ...
George.
On Wed, Sep 16, 2015 at 11:59 AM, George Bosilca <[email protected]>
wrote:
> Alexey,
>
> This is not necessarily the fix for all cases. Most of the internal uses
> of the free_list can easily accommodate to the fact that no more elements
> are available. Based on your description of the problem I would assume you
> encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called.
> In this particular case the problem is that fact that we call
> OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal
> with the case where the returned item is NULL. In this particular case the
> real fix is to use the blocking version of the free_list accessor (similar
> to the case for send) OMPI_FREE_LIST_WAIT_MT.
>
>
> It is also possible that I misunderstood your problem. IF the solution
> above doesn't work can you describe exactly where the NULL return of the
> OMPI_FREE_LIST_GET_MT is creating an issue?
>
> George.
>
>
> On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <[email protected]>
> wrote:
>
>> Hi all,
>>
>> We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
>> support level) where several threads submits a lot of MPI_Irecv() requests
>> simultaneously and encountered an intermittent bug
>> OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
>> because OMPI_FREE_LIST_GET_MT() returned NULL. Investigating this bug we
>> found that sometimes the thread calling ompi_free_list_grow() don’t have
>> any free items in LIFO list at exit because other threads retrieved all
>> new items at opal_atomic_lifo_pop()
>>
>> So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
>>
>>
>>
>> #define OMPI_FREE_LIST_GET_MT(fl, item)
>> \
>>
>> {
>> \
>>
>> item = (ompi_free_list_item_t*)
>> opal_atomic_lifo_pop(&((fl)->super)); \
>>
>> if( OPAL_UNLIKELY(NULL == item) )
>> { \
>>
>> if(opal_using_threads())
>> { \
>>
>> int rc;
>> \
>>
>>
>> opal_mutex_lock(&((fl)->fl_lock)); \
>>
>>
>> do \
>>
>> {
>> \
>>
>> rc = ompi_free_list_grow((fl),
>> (fl)->fl_num_per_alloc); \
>>
>> if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
>> break; \
>>
>>
>> \
>>
>> item = (ompi_free_list_item_t*)
>> opal_atomic_lifo_pop(&((fl)->super)); \
>>
>>
>> \
>>
>> } while
>> (!item); \
>>
>>
>> opal_mutex_unlock(&((fl)->fl_lock)); \
>>
>> } else
>> { \
>>
>> ompi_free_list_grow((fl),
>> (fl)->fl_num_per_alloc); \
>>
>> item = (ompi_free_list_item_t*)
>> opal_atomic_lifo_pop(&((fl)->super)); \
>>
>> } /* opal_using_threads() */
>> \
>>
>> } /* NULL == item
>> */ \
>>
>> }
>>
>>
>>
>>
>>
>> Another workaround is to increase the value of pml_ob1_free_list_inc
>> parameter.
>>
>>
>>
>> Regards,
>>
>> Alexey
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> [email protected]
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php
>>
>
>