Alexey,

This is not necessarily the fix for all cases. Most of the internal uses of
the free_list can easily accommodate to the fact that no more elements are
available. Based on your description of the problem I would assume you
encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called.
In this particular case the problem is that fact that we call
OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal
with the case where the returned item is NULL. In this particular case the
real fix is to use the blocking version of the free_list accessor (similar
to the case for send) OMPI_FREE_LIST_WAIT_MT.


It is also possible that I misunderstood your problem. IF the solution
above doesn't work can you describe exactly where the NULL return of the
OMPI_FREE_LIST_GET_MT is creating an issue?

George.


On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <avryzh...@compcenter.org>
wrote:

> Hi all,
>
> We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
> support level)  where several threads submits a lot of MPI_Irecv() requests
> simultaneously and encountered an intermittent bug
> OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
> because  OMPI_FREE_LIST_GET_MT()  returned NULL.  Investigating this bug we
> found that sometimes the thread calling ompi_free_list_grow()  don’t have
> any free items in LIFO list at exit because other threads  retrieved  all
> new items at opal_atomic_lifo_pop()
>
> So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
>
>
>
> #define OMPI_FREE_LIST_GET_MT(fl, item)
>                                \
>
>     {
>                           \
>
>         item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super));             \
>
>         if( OPAL_UNLIKELY(NULL == item) )
> {                                               \
>
>             if(opal_using_threads())
> {                                                    \
>
>                 int rc;
>                           \
>
>
> opal_mutex_lock(&((fl)->fl_lock));                                        \
>
>
> do                                                                        \
>
>                 {
>                                               \
>
>                     rc = ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc);               \
>
>                     if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
> break;                         \
>
>
>                                                                   \
>
>                     item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
>
> \
>
>                 } while
> (!item);                                                          \
>
>
> opal_mutex_unlock(&((fl)->fl_lock));                                      \
>
>             } else
> {                                                                      \
>
>                 ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc);                        \
>
>                 item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super));     \
>
>             } /* opal_using_threads() */
>                                   \
>
>         } /* NULL == item
> */                                                              \
>
>     }
>
>
>
>
>
> Another workaround is to increase the value of  pml_ob1_free_list_inc
> parameter.
>
>
>
> Regards,
>
> Alexey
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php
>

Reply via email to