On Wed, Sep 16, 2015 at 3:11 PM, Владимир Трущин <vdtrusc...@compcenter.org>
wrote:

> Sorry, “We saw the following problem in OMPI_FREE_LIST_GET_MT…”.
>

That's exactly what the WAIT macro is supposed to solve, wait (grow the
freelist and call opal_progress) until an item become available.

  George.



>
>
> *From:* Владимир Трущин [mailto:vdtrusc...@compcenter.org]
> *Sent:* Wednesday, September 16, 2015 10:09 PM
> *To:* 'Open MPI Developers'
> *Subject:* RE: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()
>
>
>
> George,
>
>
>
> You are right. The sequence of calls in our test is MPI_Irecv ->
> mca_pml_ob1_irecv -> MCA_PML_OB1_RECV_REQUEST_ALLOC. We will try to use
> OMPI_FREE_LIST_WAIT_MT.
>
>
>
> We saw the following problem in OMPI_FREE_LIST_WAIT_MT. It returned NULL
> in case when thread A was suspended after the call of  ompi_free_list_grow.
> At this time others threads took all items from free list at the first call
> of opal_atomic_lifo_pop in macro. So, when thread A was unsuspended and
> call the second opal_atomic_lifo_pop in macro - it returned NULL.
>
>
>
> Best regards,
>
> Vladimir.
>
>
>
> *From:* devel [mailto:devel-boun...@open-mpi.org
> <devel-boun...@open-mpi.org>] *On Behalf Of *George Bosilca
> *Sent:* Wednesday, September 16, 2015 7:00 PM
> *To:* Open MPI Developers
> *Subject:* Re: [OMPI devel] The issue with OMPI_FREE_LIST_GET_MT()
>
>
>
> Alexey,
>
>
>
> This is not necessarily the fix for all cases. Most of the internal uses
> of the free_list can easily accommodate to the fact that no more elements
> are available. Based on your description of the problem I would assume you
> encounter this problem once the MCA_PML_OB1_RECV_REQUEST_ALLOC is called.
> In this particular case the problem is that fact that we call
> OMPI_FREE_LIST_GET_MT and that the upper level is unable to correctly deal
> with the case where the returned item is NULL. In this particular case the
> real fix is to use the blocking version of the free_list accessor (similar
> to the case for send) OMPI_FREE_LIST_WAIT_MT.
>
>
>
>
>
> It is also possible that I misunderstood your problem. IF the solution
> above doesn't work can you describe exactly where the NULL return of the
> OMPI_FREE_LIST_GET_MT is creating an issue?
>
>
>
> George.
>
>
>
>
>
> On Wed, Sep 16, 2015 at 9:03 AM, Алексей Рыжих <avryzh...@compcenter.org>
> wrote:
>
> Hi all,
>
> We experimented with MPI+OpenMP hybrid application (MPI_THREAD_MULTIPLE
> support level)  where several threads submits a lot of MPI_Irecv() requests
> simultaneously and encountered an intermittent bug
> OMPI_ERR_TEMP_OUT_OF_RESOURCE after MCA_PML_OB1_RECV_REQUEST_ALLOC()
> because  OMPI_FREE_LIST_GET_MT()  returned NULL.  Investigating this bug we
> found that sometimes the thread calling ompi_free_list_grow()  don’t have
> any free items in LIFO list at exit because other threads  retrieved  all
> new items at opal_atomic_lifo_pop()
>
> So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
>
>
>
> #define OMPI_FREE_LIST_GET_MT(fl, item)
>                                \
>
>     {
>                           \
>
>         item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super));             \
>
>         if( OPAL_UNLIKELY(NULL == item) )
> {                                               \
>
>             if(opal_using_threads())
> {                                                    \
>
>                 int rc;
>                           \
>
>
> opal_mutex_lock(&((fl)->fl_lock));                                        \
>
>
> do                                                                        \
>
>                 {
>                                               \
>
>                     rc = ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc);               \
>
>                     if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
> break;                         \
>
>
>                                                                   \
>
>                     item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
>
> \
>
>                 } while
> (!item);                                                          \
>
>
> opal_mutex_unlock(&((fl)->fl_lock));                                      \
>
>             } else
> {                                                                      \
>
>                 ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc);                        \
>
>                 item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super));     \
>
>             } /* opal_using_threads() */
>                                   \
>
>         } /* NULL == item
> */                                                              \
>
>     }
>
>
>
>
>
> Another workaround is to increase the value of  pml_ob1_free_list_inc
> parameter.
>
>
>
> Regards,
>
> Alexey
>
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php
>
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18054.php
>

Reply via email to