While looking into a possible fix for this problem we should also cleanup
in the trunk the leftover from the OMPI_FREE_LIST.

$find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} +
./opal/mca/btl/usnic/btl_usnic_compat.h:161:    OMPI_FREE_LIST_GET_MT(list,
(item))
./ompi/mca/pml/bfo/pml_bfo_recvreq.h:89:
OMPI_FREE_LIST_GET_MT(&mca_pml_base_recv_requests, item);          \
./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:149:
 OMPI_FREE_LIST_GET_MT(&cm->tasks_free, item);
./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:206:
 OMPI_FREE_LIST_GET_MT(task_list, item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:107:
 OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:146:
 OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:208:
 OMPI_FREE_LIST_GET_MT(&iboffload->device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_qp_info.c:156:
 OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
./ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h:130:
 OMPI_FREE_LIST_GET_MT(&cm->collfrags_free, item);
./ompi/mca/bcol/iboffload/bcol_iboffload_frag.h:115:
 OMPI_FREE_LIST_GET_MT(&cm->ml_frags_free, item);

I wonder how these are even compiling ...

  George.



On Wed, Sep 16, 2015 at 11:59 AM, George Bosilca <bosi...@icl.utk.edu>
wrote:

> Alexey,
>
> This is not necessarily the fix for all cases. Most of the internal uses
> of the free_list can easily accommodate to the fact that no more elements
> are available. Based on your description of the problem I would assume you
> encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called.
> In this particular case the problem is that fact that we call
> OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal
> with the case where the returned item is NULL. In this particular case the
> real fix is to use the blocking version of the free_list accessor (similar
> to the case for send) OMPI_FREE_LIST_WAIT_MT.
>
>
> It is also possible that I misunderstood your problem. IF the solution
> above doesn't work can you describe exactly where the NULL return of the
> OMPI_FREE_LIST_GET_MT is creating an issue?
>
> George.
>
>
> On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <avryzh...@compcenter.org>
> wrote:
>
>> Hi all,
>>
>> We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
>> support level)  where several threads submits a lot of MPI_Irecv() requests
>> simultaneously and encountered an intermittent bug
>> OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
>> because  OMPI_FREE_LIST_GET_MT()  returned NULL.  Investigating this bug we
>> found that sometimes the thread calling ompi_free_list_grow()  don’t have
>> any free items in LIFO list at exit because other threads  retrieved  all
>> new items at opal_atomic_lifo_pop()
>>
>> So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
>>
>>
>>
>> #define OMPI_FREE_LIST_GET_MT(fl, item)
>>                                \
>>
>>     {
>>                           \
>>
>>         item = (ompi_free_list_item_t*)
>> opal_atomic_lifo_pop(&((fl)->super));             \
>>
>>         if( OPAL_UNLIKELY(NULL == item) )
>> {                                               \
>>
>>             if(opal_using_threads())
>> {                                                    \
>>
>>                 int rc;
>>                           \
>>
>>
>> opal_mutex_lock(&((fl)->fl_lock));                                        \
>>
>>
>> do                                                                        \
>>
>>                 {
>>                                               \
>>
>>                     rc = ompi_free_list_grow((fl),
>> (fl)->fl_num_per_alloc);               \
>>
>>                     if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
>> break;                         \
>>
>>
>>                                                                   \
>>
>>                     item = (ompi_free_list_item_t*)
>> opal_atomic_lifo_pop(&((fl)->super)); \
>>
>>
>> \
>>
>>                 } while
>> (!item);                                                          \
>>
>>
>> opal_mutex_unlock(&((fl)->fl_lock));                                      \
>>
>>             } else
>> {                                                                      \
>>
>>                 ompi_free_list_grow((fl),
>> (fl)->fl_num_per_alloc);                        \
>>
>>                 item = (ompi_free_list_item_t*)
>> opal_atomic_lifo_pop(&((fl)->super));     \
>>
>>             } /* opal_using_threads() */
>>                                   \
>>
>>         } /* NULL == item
>> */                                                              \
>>
>>     }
>>
>>
>>
>>
>>
>> Another workaround is to increase the value of  pml_ob1_free_list_inc
>> parameter.
>>
>>
>>
>> Regards,
>>
>> Alexey
>>
>>
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post:
>> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php
>>
>
>

Reply via email to