Hi all,

We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
support level)  where several threads submits a lot of MPI_Irecv() requests
simultaneously and encountered an intermittent bug
OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
because  OMPI_FREE_LIST_GET_MT()  returned NULL.  Investigating this bug we
found that sometimes the thread calling ompi_free_list_grow()  don’t have
any free items in LIFO list at exit because other threads  retrieved  all
new items at opal_atomic_lifo_pop()

So we suggest to change OMPI_FREE_LIST_GET_MT() as below:



#define OMPI_FREE_LIST_GET_MT(fl, item)
                               \

    {
                          \

        item = (ompi_free_list_item_t*)
opal_atomic_lifo_pop(&((fl)->super));             \

        if( OPAL_UNLIKELY(NULL == item) )
{                                               \

            if(opal_using_threads())
{                                                    \

                int rc;
                          \


opal_mutex_lock(&((fl)->fl_lock));                                        \


do                                                                        \

                {
                                              \

                    rc = ompi_free_list_grow((fl),
(fl)->fl_num_per_alloc);               \

                    if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
break;                         \


                                                                  \

                    item = (ompi_free_list_item_t*)
opal_atomic_lifo_pop(&((fl)->super)); \


\

                } while
(!item);                                                          \


opal_mutex_unlock(&((fl)->fl_lock));                                      \

            } else {
              \

                ompi_free_list_grow((fl),
(fl)->fl_num_per_alloc);                        \

                item = (ompi_free_list_item_t*)
opal_atomic_lifo_pop(&((fl)->super));     \

            } /* opal_using_threads() */
                                  \

        } /* NULL == item
*/                                                              \

    }





Another workaround is to increase the value of  pml_ob1_free_list_inc
parameter.



Regards,

Alexey

Reply via email to