Just wanted to bring this to everyone's attention.  I don't
particularly understand the use of bmi_test_flag to protect
calls to BMI_testcontext.  I would have thought that the device
interface mutexes would have been sufficient.

The problem was that the loop in PINT_thread_mgr_bmi_cancel()
that tries to grab bmi_test_mutex when bmi_test_flag is zero
was never succeeding.  Added printfs showed it getting signaled
but when it went to retest bmi_test_flag, it was always 1 again,
showing that the bmi_thread_function() thread had gotten it
again.

So I patched around that, in a not very elegant but perfectly
serviceable way, but thought someone who understands this better
could double-check that we still really need this bmi_test_flag
construct.

Patch below already applied.

                -- Pete

[EMAIL PROTECTED] wrote on Tue, 19 Sep 2006 17:37 -0400:
> In directory parlweb1:/tmp/cvs-serv27961/src/io/job
> 
> Modified Files:
>       thread-mgr.c 
> Log Message:
> avoid starvation of a thread trying to BMI_cancel by adding another global 
> variable
> 
> 
> Index: thread-mgr.c
> ===================================================================
> RCS file: /projects/cvsroot/pvfs2/src/io/job/thread-mgr.c,v
> diff -u -p -p -u -r1.31 -r1.32
> --- thread-mgr.c      28 Aug 2006 19:01:57 -0000      1.31
> +++ thread-mgr.c      19 Sep 2006 21:37:25 -0000      1.32
> @@ -61,6 +61,7 @@ static pthread_cond_t trove_test_cond = 
>   */
>  static gen_mutex_t bmi_test_mutex = GEN_MUTEX_INITIALIZER;
>  static int bmi_test_flag = 0;
> +static int bmi_test_cancel_waiter = 0;
>  static int bmi_test_count = 0;
>  static gen_mutex_t trove_test_mutex = GEN_MUTEX_INITIALIZER;
>  static int trove_test_flag = 0;
> @@ -218,6 +219,14 @@ static void *bmi_thread_function(void *p
>  
>       /* indicate that a test is in progress */
>       gen_mutex_lock(&bmi_test_mutex);
> +#ifdef __PVFS2_JOB_THREADED__
> +        /* wait politely for any cancel operations to run; else we're
> +         * too fast in regrabbing the test_mutex and it hangs waiting for
> +         * that small window where test_flag is zero. */
> +        while (bmi_test_cancel_waiter) {
> +            pthread_cond_wait(&bmi_test_cond, &bmi_test_mutex);
> +        }
> +#endif
>       bmi_test_flag = 1;
>       gen_mutex_unlock(&bmi_test_mutex);
>       
> @@ -509,6 +518,7 @@ int PINT_thread_mgr_bmi_cancel(PVFS_id_g
>       * progress
>       */
>      gen_mutex_lock(&bmi_test_mutex);
> +    ++bmi_test_cancel_waiter;
>      while(bmi_test_flag == 1)
>      {
>  #ifdef __PVFS2_JOB_THREADED__
> @@ -518,14 +528,14 @@ int PINT_thread_mgr_bmi_cancel(PVFS_id_g
>       assert(0);
>  #endif
>      }
> +    --bmi_test_cancel_waiter;
>  
>      /* iterate down list of pending completions, to see if the caller is
>       * trying to cancel one of them
>       */
> -#if 0
> -    gossip_err("THREAD MGR trying to cancel op: %llu, ptr: %p.\n",
> -     llu(id), user_ptr);
> -#endif
> +    gossip_debug(GOSSIP_JOB_DEBUG,
> +                 "%s: trying to cancel opid: %llu, ptr: %p.\n",
> +              __func__, llu(id), user_ptr);
>      for(i=0; i<bmi_test_count; i++)
>      {
>  #if 0
> @@ -549,6 +559,10 @@ int PINT_thread_mgr_bmi_cancel(PVFS_id_g
>      ret = BMI_cancel(id, global_bmi_context);
>      if(ret < 0)
>       gossip_err("WARNING: BMI cancel failed, proceeding anyway.\n");
> +#ifdef __PVFS2_JOB_THREADED__
> +    /* release waiting testcontext thread */
> +    pthread_cond_signal(&bmi_test_cond);
> +#endif
>      gen_mutex_unlock(&bmi_test_mutex);
>      return(ret);
>  }
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to