Hey Bart,

I haven't really tested this change yet, but can you try the attached patch and see if that seems to solve the problem? I think this is follow on to the same bug you guys reported earlier. I just missed another race issue caused by the last patch.

-Phil

On 05/12/2010 05:18 PM, Bart Taylor wrote:
Hey guys,

I have a 3 node local disk file system that had a core dump during some testing. It is an upgraded fs from 2.6 to 2.8.2. After the upgrade, I ran a couple of utilities like pvfs2-ping and pvfs2-statfs. After those succeeded, I attempted to create a new file of around 800K, and the first server died. There wasn't anything useful in the logs or dmesg. Below is a backtrace from the core file. I can supply the entire file, but I can't email it at 43M.

This may be related to the precreate-pool-race patch from a few days ago since the backtrace indicates it was in the vicinity of those code changes.

Let me know what else I can supply that will help.

Bart.




(gdb) bt
#0  0x009e37a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1  0x00a24825 in raise () from /lib/tls/libc.so.6
#2  0x00a26289 in abort () from /lib/tls/libc.so.6
#3  0x00a58d2a in __libc_message () from /lib/tls/libc.so.6
#4  0x00a5f72f in _int_free () from /lib/tls/libc.so.6
#5  0x00a5fbaa in free () from /lib/tls/libc.so.6
#6 0x0807d6e5 in precreate_pool_get_thread_mgr_callback_unlocked (data=0xb55d30f0, error_code=0) at ../pvfs2_src/src/io/job/job.c:4456 #7 0x0807fd3d in precreate_pool_get_handles_try_post (jd=0xb55d4110) at ../pvfs2_src/src/io/job/job.c:5930 #8 0x0807f5b9 in job_precreate_pool_get_handles (fsid=140299291, count=2, servers=0x0, handle_array=0xb55d41f0, flags=0, user_ptr=0xb5507c98, status_user_tag=0, out_status_p=0x9c23348, id=0xbffc11b0, context_id=0, hints=0xb5506a88) at ../pvfs2_src/src/io/job/job.c:5718 #9 0x0806c3cc in get_handles (smcb=0xb5507c98, js_p=0x9c23348) at ../pvfs2_src/src/server/unstuff.sm:267 <http://unstuff.sm:267> #10 0x08095e06 in PINT_state_machine_invoke (smcb=0xb5507c98, r=0x9c23348) at ../pvfs2_src/src/common/misc/state-machine-fns.c:132 #11 0x080961c4 in PINT_state_machine_next (smcb=0xb5507c98, r=0x9c23348) at ../pvfs2_src/src/common/misc/state-machine-fns.c:309 #12 0x08096200 in PINT_state_machine_continue (smcb=0xb5507c98, r=0x9c23348) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327 #13 0x0805667c in main (argc=6, argv=0xbffc1334) at ../pvfs2_src/src/server/pvfs2-server.c:413


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

---------------------
PatchSet 7929 
Date: 2010/05/13 14:10:06
Author: pcarns
Branch: HEAD
Tag: (none) 
Log:
experimental update of fix for precreate bug reported by Bart Taylor (original
fix led to a seg fault).

Members: 
	src/io/job/job.c:1.191->1.192 

Index: pvfs2-1/src/io/job/job.c
diff -u pvfs2-1/src/io/job/job.c:1.191 pvfs2-1/src/io/job/job.c:1.192
--- pvfs2-1/src/io/job/job.c:1.191	Thu May  6 15:42:34 2010
+++ pvfs2-1/src/io/job/job.c	Thu May 13 14:10:06 2010
@@ -4428,20 +4428,24 @@
     trove_pending_count--;
     assert(trove_pending_count >= 0);
 
-    tmp_trove->jd->u.precreate_pool.trove_pending--;
-    assert(tmp_trove->jd->u.precreate_pool.trove_pending >= 0);
-
     /* don't overwrite error codes from other trove ops */
     if(tmp_trove->jd->u.precreate_pool.error_code == 0)
     {
         tmp_trove->jd->u.precreate_pool.error_code = error_code;
     }
 
+    /* acquiring this mutex a little early so that it can also serve to
+     * prevent multiple trove operations from racing between decrementing 
+     * and then reading the pool.trove_pending counter
+     */
+    gen_mutex_lock(&completion_mutex);
+
+    tmp_trove->jd->u.precreate_pool.trove_pending--;
+    assert(tmp_trove->jd->u.precreate_pool.trove_pending >= 0);
+
     /* is this job done? */
     if(tmp_trove->jd->u.precreate_pool.trove_pending == 0)
     {
-        gen_mutex_lock(&completion_mutex);
-
         /* set job descriptor fields and put into completion queue */
         tmp_trove->jd->u.precreate_pool.error_code = 0;
         job_desc_q_add(completion_queue_array[tmp_trove->jd->context_id], 
@@ -4458,6 +4462,7 @@
         return;
     }
 
+    gen_mutex_unlock(&completion_mutex);
     return;
 }
 
@@ -5704,8 +5709,8 @@
     jd->u.precreate_pool.precreate_handle_index = 0;
     jd->u.precreate_pool.fsid = fsid;
     jd->u.precreate_pool.servers = servers;
-    jd->u.precreate_pool.trove_pending = 0;
     jd->u.precreate_pool.flags = flags;
+    jd->u.precreate_pool.trove_pending = 0;
 
     /* rotate to use a different starting server in the pool next time */
     gen_mutex_lock(&precreate_pool_mutex);
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to