iboffload and bfo are opal ignored by default. Neither exists in the release branch.
-Nathan
On Wed, Sep 16, 2015 at 12:02:29PM -0400, George Bosilca wrote:
> While looking into a possible fix for this problem we should also cleanup
> in the trunk the leftover from the OMPI_FREE_LIST.
> $find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} +
> ./opal/mca/btl/usnic/btl_usnic_compat.h:161:
> OMPI_FREE_LIST_GET_MT(list, (item))
> ./ompi/mca/pml/bfo/pml_bfo_recvreq.h:89:
> OMPI_FREE_LIST_GET_MT(&mca_pml_base_recv_requests, item); \
> ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:149:
> OMPI_FREE_LIST_GET_MT(&cm->tasks_free, item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:206:
> OMPI_FREE_LIST_GET_MT(task_list, item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:107:
> OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:146:
> OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:208:
> OMPI_FREE_LIST_GET_MT(&iboffload->device->frags_free[qp_index], item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_qp_info.c:156:
> OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h:130:
> OMPI_FREE_LIST_GET_MT(&cm->collfrags_free, item);
> ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.h:115:
> OMPI_FREE_LIST_GET_MT(&cm->ml_frags_free, item);
> I wonder how these are even compiling ...
> George.
> On Wed, Sep 16, 2015 at 11:59 AM, George Bosilca <[email protected]>
> wrote:
>
> Alexey,
> This is not necessarily the fix for all cases. Most of the internal uses
> of the free_list can easily accommodate to the fact that no more
> elements are available. Based on your description of the problem I would
> assume you encounter this problem once the
> MCA_PML_OB1_RECV_REQUEST_ALLOC is called. In this particular case the
> problem is that fact that we call OMPI_FREE_LIST_GET_MT and that the
> upper level is unable to correctly deal with the case where the returned
> item is NULL. In this particular case the real fix is to use the
> blocking version of the free_list accessor (similar to the case for
> send) OMPI_FREE_LIST_WAIT_MT.
> It is also possible that I misunderstood your problem. IF the solution
> above doesn't work can you describe exactly where the NULL return of the
> OMPI_FREE_LIST_GET_MT is creating an issue?
> George.
> On Wed, Sep 16, 2015 at 9:03 AM, Aleksej Ryzhih
> <[email protected]> wrote:
>
> Hi all,
>
> We experimented with MPI+OpenMP hybrid application
> (MPI_THREAD_MULTIPLE support level) where several threads submits a
> lot of MPI_Irecv() requests simultaneously and encountered an
> intermittent bug OMPI_ERR_TEMP_OUT_OF_RESOURCE after
> MCA_PML_OB1_RECV_REQUEST_ALLOC() because OMPI_FREE_LIST_GET_MT()
> returned NULL. Investigating this bug we found that sometimes the
> thread calling ompi_free_list_grow() don't have any free items in
> LIFO list at exit because other threads retrieved all new items at
> opal_atomic_lifo_pop()
>
> So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
>
>
>
> #define OMPI_FREE_LIST_GET_MT(fl,
> item) \
>
>
> {
> \
>
> item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
> if( OPAL_UNLIKELY(NULL == item) )
> { \
>
> if(opal_using_threads())
> { \
>
> int rc;
> \
>
>
> opal_mutex_lock(&((fl)->fl_lock));
>
> \
>
>
> do
>
> \
>
> {
> \
>
> rc = ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc); \
>
> if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
> break; \
>
>
> \
>
> item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
>
>
> \
>
> } while
> (!item); \
>
>
> opal_mutex_unlock(&((fl)->fl_lock));
>
> \
>
> } else
> {
> \
>
> ompi_free_list_grow((fl),
> (fl)->fl_num_per_alloc); \
>
> item = (ompi_free_list_item_t*)
> opal_atomic_lifo_pop(&((fl)->super)); \
>
> } /* opal_using_threads() */
> \
>
> } /* NULL == item
> */ \
>
> }
>
>
>
>
>
> Another workaround is to increase the value of pml_ob1_free_list_inc
> parameter.
>
>
>
> Regards,
>
> Alexey
>
>
>
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18039.php
> _______________________________________________
> devel mailing list
> [email protected]
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/09/18046.php
pgpzph1au1DXK.pgp
Description: PGP signature
