Hello Fab,
Recently, we got the following assert  at __ioc_async_cb:            CL_ASSERT( 
!gp_ioc_pnp->query_cnt ); (==-1)
We suspect the problem appeared because of:

1.       From __ioc_pnp_timer_cb(
                /* Pre-charge the ref count so that we don't toggle between 0 
and 1. */
                cl_atomic_inc( &p_mgr->query_cnt );
                /* Take a reference on the object for the duration of the sweep 
process. */
                ref_al_obj( &p_mgr->obj );
                for( p_item = cl_qlist_head( &p_mgr->obj.obj_list );
                                p_item != cl_qlist_end( &p_mgr->obj.obj_list );
                                p_item = cl_qlist_next( p_item ) )
                {
                                p_svc = PARENT_STRUCT( PARENT_STRUCT( p_item, 
al_obj_t, pool_item ),
                                                ioc_pnp_svc_t, obj );
                                cl_atomic_inc( &p_mgr->query_cnt );
                                status = __ioc_query_sa( p_svc );
                                if( status != IB_SUCCESS )
                                                cl_atomic_dec( 
&p_mgr->query_cnt );
                }
                /* Release the reference we took and see if we're done 
sweeping. */
                if( !cl_atomic_dec( &p_mgr->query_cnt ) )
                                cl_async_proc_queue( gp_async_pnp_mgr, 
&p_mgr->async_item )

2.       Here one assumes that either one of 2 situations is true:

·         __ioc_query_sa - returns success and later the reference will be 
removed by the appropriate callback exactly one time

·         It fails and the reference will be removed immediately
Occasionally, this is not the case

3.       __ioc_query_sa calls twice to ib_query: for Node and Path queries. 
Thus, 2 callbacks will be invoked thereafter: __node_rec_cb and __path_rec_cb

4.       Each of these callback may (and will) call to __process_query()

5.       If process_query will be called twice, that query_cnt will be 
decreased twice and we are in trouble.

6.       Theoretically, there is another counter that should avoid this case.

>From __ioc_query_sa():

........

cl_atomic_inc( &p_svc->query_cnt );<---Raise reference to ensure that 
__process_query will be called only once

status = ib_query( gh_al, &query, &p_svc->h_node_query );

        if( status != IB_SUCCESS )

        {

                        cl_atomic_dec( &p_svc->query_cnt );

                        deref_al_obj( &p_svc->obj );

                        AL_PRINT_EXIT( TRACE_LEVEL_WARNING, AL_DBG_PNP,

                                        ("ib_query returned %s\n", 
ib_get_err_str( status )) );

                        return status;

        }

..............

        /* Reference the service for the node record query. */

        ref_al_obj( &p_svc->obj );

        cl_atomic_inc( &p_svc->query_cnt );



        status = ib_query( gh_al, &query, &p_svc->h_path_query );

        if( status != IB_SUCCESS )

        {

                        cl_atomic_dec( &p_svc->query_cnt );

                        deref_al_obj( &p_svc->obj );

                        AL_PRINT_EXIT( TRACE_LEVEL_WARNING, AL_DBG_PNP,

                                        ("ib_query returned %s\n", 
ib_get_err_str( status )) );

        }

>From __path_rec_cb and node_rec_cb:



if( !cl_atomic_dec( &p_svc->query_cnt ) ) //should avoid calling 
__process_query twice

        {

                        /* The node query has already completed.  Process the 
results. */

                        __process_query( p_svc );

        }



The issues that we see with the following situation are this:

1)       When __ioc_query_sa is running, the first query will be executed and 
it's call back will be called before the second query starts. In this case, the 
__node_rec_cb will decrease the p_svc->query_cnt and bring it to 0. Thus, 
__process_query will be called twice !

2)       When looking at the function __process_query we expect it to call 
cl_atomic_dec( &gp_ioc_pnp->query_cnt)exactly one time. However, if the code 
executes till the end of the function, than we reach a case statement in which 
in the case of status == success the reference will not be removed.

3)       p_svc->query_cnt seems to be like a reference count that is protecting 
gp_ioc_pnp->query_cnt. However this only works if p_svc->query_cnt was 0 when 
__ioc_pnp_timer_cb was called and no one has touched it on the entire time. 
However, we see that p_svc->query_cnt is being touched in many places in the 
code, and we don't seem to understand why this assumptions are right.

4)       What is the consequences of calling __process_query more than once 
(this is an issue that we believe that might currently happen in the code).

We are still trying to understand the assumptions that this code is based on.

Thanks
Tzachi & Alexander (XaleX) Naslednikov

_______________________________________________
ofw mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ofw

Reply via email to