iboffload and bfo are opal ignored by default. Neither exists in the
release branch.

-Nathan

On Wed, Sep 16, 2015 at 12:02:29PM -0400, George Bosilca wrote:
>    While looking into a possible fix for this problem we should also cleanup
>    in the trunk the leftover from the OMPI_FREE_LIST.
>    $find . -name "*.[ch]" -exec grep -Hn OMPI_FREE_LIST_GET_MT {} +
>    ./opal/mca/btl/usnic/btl_usnic_compat.h:161:  
>     OMPI_FREE_LIST_GET_MT(list, (item))
>    ./ompi/mca/pml/bfo/pml_bfo_recvreq.h:89:  
>    OMPI_FREE_LIST_GET_MT(&mca_pml_base_recv_requests, item);          \
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:149:  
>     OMPI_FREE_LIST_GET_MT(&cm->tasks_free, item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_task.h:206:  
>     OMPI_FREE_LIST_GET_MT(task_list, item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:107:  
>     OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:146:  
>     OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.c:208:          
>     OMPI_FREE_LIST_GET_MT(&iboffload->device->frags_free[qp_index], item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_qp_info.c:156:      
>     OMPI_FREE_LIST_GET_MT(&device->frags_free[qp_index], item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_collfrag.h:130:  
>     OMPI_FREE_LIST_GET_MT(&cm->collfrags_free, item);
>    ./ompi/mca/bcol/iboffload/bcol_iboffload_frag.h:115:  
>     OMPI_FREE_LIST_GET_MT(&cm->ml_frags_free, item);
>    I wonder how these are even compiling ...
>      George.
>    On Wed, Sep 16, 2015 at 11:59 AM, George Bosilca <bosi...@icl.utk.edu>
>    wrote:
> 
>      Alexey,
>      This is not necessarily the fix for all cases. Most of the internal uses
>      of the free_list can easily accommodate to the fact that no more
>      elements are available. Based on your description of the problem I would
>      assume you encounter this problem once the
>      MCA_PML_OB1_RECV_REQUEST_ALLOC is called. In this particular case the
>      problem is that fact that we call OMPI_FREE_LIST_GET_MT and that the
>      upper level is unable to correctly deal with the case where the returned
>      item is NULL. In this particular case the real fix is to use the
>      blocking version of the free_list accessor (similar to the case for
>      send) OMPI_FREE_LIST_WAIT_MT.
>      It is also possible that I misunderstood your problem. IF the solution
>      above doesn't work can you describe exactly where the NULL return of the
>      OMPI_FREE_LIST_GET_MT is creating an issue?
>      George.
>      On Wed, Sep 16, 2015 at 9:03 AM, Aleksej Ryzhih
>      <avryzh...@compcenter.org> wrote:
> 
>        Hi all,
> 
>        We experimented with MPI+OpenMP hybrid application
>        (MPI_THREAD_MULTIPLE support level)  where several threads submits a
>        lot of MPI_Irecv() requests simultaneously and encountered an
>        intermittent bug OMPI_ERR_TEMP_OUT_OF_RESOURCE after
>        MCA_PML_OB1_RECV_REQUEST_ALLOC()  because  OMPI_FREE_LIST_GET_MT()
>         returned NULL.  Investigating this bug we found that sometimes the
>        thread calling ompi_free_list_grow()  don't have any free items in
>        LIFO list at exit because other threads  retrieved  all new items at
>        opal_atomic_lifo_pop() 
> 
>        So we suggest to change OMPI_FREE_LIST_GET_MT() as below:
> 
>         
> 
>        #define OMPI_FREE_LIST_GET_MT(fl,
>        item)                                                                \
> 
>           
>        {                                                                  
>                                  \
> 
>                item = (ompi_free_list_item_t*)
>        opal_atomic_lifo_pop(&((fl)->super));             \
> 
>                if( OPAL_UNLIKELY(NULL == item) )
>        {                                               \
> 
>                    if(opal_using_threads())
>        {                                                    \
> 
>                        int rc;                                        
>                                  \
> 
>                       
>        opal_mutex_lock(&((fl)->fl_lock));                                     
>   
>        \
> 
>                       
>        do                                                                     
>   
>        \
> 
>                        {                          
>                                                      \
> 
>                            rc = ompi_free_list_grow((fl),
>        (fl)->fl_num_per_alloc);               \
> 
>                            if( OPAL_UNLIKELY(rc != OMPI_SUCCESS))
>        break;                         \
> 
>                               
>                                                                          \
> 
>                            item = (ompi_free_list_item_t*)
>        opal_atomic_lifo_pop(&((fl)->super)); \
> 
>                                                                               
>                   
>        \
> 
>                        } while
>        (!item);                                                          \
> 
>                       
>        opal_mutex_unlock(&((fl)->fl_lock));                                   
>   
>        \
> 
>                    } else
>        {                                                       
>                      \
> 
>                        ompi_free_list_grow((fl),
>        (fl)->fl_num_per_alloc);                        \
> 
>                        item = (ompi_free_list_item_t*)
>        opal_atomic_lifo_pop(&((fl)->super));     \
> 
>                    } /* opal_using_threads() */               
>                                          \
> 
>                } /* NULL == item
>        */                                                              \
> 
>            }
> 
>         
> 
>         
> 
>        Another workaround is to increase the value of  pml_ob1_free_list_inc
>        parameter.
> 
>         
> 
>        Regards,
> 
>        Alexey
> 
>         
> 
>        _______________________________________________
>        devel mailing list
>        de...@open-mpi.org
>        Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>        Link to this post:
>        http://www.open-mpi.org/community/lists/devel/2015/09/18039.php

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/09/18046.php

Attachment: pgpzph1au1DXK.pgp
Description: PGP signature

Reply via email to