Just wanted to bring this to everyone's attention. I don't
particularly understand the use of bmi_test_flag to protect
calls to BMI_testcontext. I would have thought that the device
interface mutexes would have been sufficient.
The problem was that the loop in PINT_thread_mgr_bmi_cancel()
that tries to grab bmi_test_mutex when bmi_test_flag is zero
was never succeeding. Added printfs showed it getting signaled
but when it went to retest bmi_test_flag, it was always 1 again,
showing that the bmi_thread_function() thread had gotten it
again.
So I patched around that, in a not very elegant but perfectly
serviceable way, but thought someone who understands this better
could double-check that we still really need this bmi_test_flag
construct.
Patch below already applied.
-- Pete
[EMAIL PROTECTED] wrote on Tue, 19 Sep 2006 17:37 -0400:
> In directory parlweb1:/tmp/cvs-serv27961/src/io/job
>
> Modified Files:
> thread-mgr.c
> Log Message:
> avoid starvation of a thread trying to BMI_cancel by adding another global
> variable
>
>
> Index: thread-mgr.c
> ===================================================================
> RCS file: /projects/cvsroot/pvfs2/src/io/job/thread-mgr.c,v
> diff -u -p -p -u -r1.31 -r1.32
> --- thread-mgr.c 28 Aug 2006 19:01:57 -0000 1.31
> +++ thread-mgr.c 19 Sep 2006 21:37:25 -0000 1.32
> @@ -61,6 +61,7 @@ static pthread_cond_t trove_test_cond =
> */
> static gen_mutex_t bmi_test_mutex = GEN_MUTEX_INITIALIZER;
> static int bmi_test_flag = 0;
> +static int bmi_test_cancel_waiter = 0;
> static int bmi_test_count = 0;
> static gen_mutex_t trove_test_mutex = GEN_MUTEX_INITIALIZER;
> static int trove_test_flag = 0;
> @@ -218,6 +219,14 @@ static void *bmi_thread_function(void *p
>
> /* indicate that a test is in progress */
> gen_mutex_lock(&bmi_test_mutex);
> +#ifdef __PVFS2_JOB_THREADED__
> + /* wait politely for any cancel operations to run; else we're
> + * too fast in regrabbing the test_mutex and it hangs waiting for
> + * that small window where test_flag is zero. */
> + while (bmi_test_cancel_waiter) {
> + pthread_cond_wait(&bmi_test_cond, &bmi_test_mutex);
> + }
> +#endif
> bmi_test_flag = 1;
> gen_mutex_unlock(&bmi_test_mutex);
>
> @@ -509,6 +518,7 @@ int PINT_thread_mgr_bmi_cancel(PVFS_id_g
> * progress
> */
> gen_mutex_lock(&bmi_test_mutex);
> + ++bmi_test_cancel_waiter;
> while(bmi_test_flag == 1)
> {
> #ifdef __PVFS2_JOB_THREADED__
> @@ -518,14 +528,14 @@ int PINT_thread_mgr_bmi_cancel(PVFS_id_g
> assert(0);
> #endif
> }
> + --bmi_test_cancel_waiter;
>
> /* iterate down list of pending completions, to see if the caller is
> * trying to cancel one of them
> */
> -#if 0
> - gossip_err("THREAD MGR trying to cancel op: %llu, ptr: %p.\n",
> - llu(id), user_ptr);
> -#endif
> + gossip_debug(GOSSIP_JOB_DEBUG,
> + "%s: trying to cancel opid: %llu, ptr: %p.\n",
> + __func__, llu(id), user_ptr);
> for(i=0; i<bmi_test_count; i++)
> {
> #if 0
> @@ -549,6 +559,10 @@ int PINT_thread_mgr_bmi_cancel(PVFS_id_g
> ret = BMI_cancel(id, global_bmi_context);
> if(ret < 0)
> gossip_err("WARNING: BMI cancel failed, proceeding anyway.\n");
> +#ifdef __PVFS2_JOB_THREADED__
> + /* release waiting testcontext thread */
> + pthread_cond_signal(&bmi_test_cond);
> +#endif
> gen_mutex_unlock(&bmi_test_mutex);
> return(ret);
> }
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers