Hey Bart,
I haven't really tested this change yet, but can you try the attached
patch and see if that seems to solve the problem? I think this is
follow on to the same bug you guys reported earlier. I just missed
another race issue caused by the last patch.
-Phil
On 05/12/2010 05:18 PM, Bart Taylor wrote:
Hey guys,
I have a 3 node local disk file system that had a core dump during
some testing. It is an upgraded fs from 2.6 to 2.8.2. After the
upgrade, I ran a couple of utilities like pvfs2-ping and pvfs2-statfs.
After those succeeded, I attempted to create a new file of around
800K, and the first server died. There wasn't anything useful in the
logs or dmesg. Below is a backtrace from the core file. I can supply
the entire file, but I can't email it at 43M.
This may be related to the precreate-pool-race patch from a few days
ago since the backtrace indicates it was in the vicinity of those code
changes.
Let me know what else I can supply that will help.
Bart.
(gdb) bt
#0 0x009e37a2 in _dl_sysinfo_int80 () from /lib/ld-linux.so.2
#1 0x00a24825 in raise () from /lib/tls/libc.so.6
#2 0x00a26289 in abort () from /lib/tls/libc.so.6
#3 0x00a58d2a in __libc_message () from /lib/tls/libc.so.6
#4 0x00a5f72f in _int_free () from /lib/tls/libc.so.6
#5 0x00a5fbaa in free () from /lib/tls/libc.so.6
#6 0x0807d6e5 in precreate_pool_get_thread_mgr_callback_unlocked
(data=0xb55d30f0, error_code=0) at ../pvfs2_src/src/io/job/job.c:4456
#7 0x0807fd3d in precreate_pool_get_handles_try_post (jd=0xb55d4110)
at ../pvfs2_src/src/io/job/job.c:5930
#8 0x0807f5b9 in job_precreate_pool_get_handles (fsid=140299291,
count=2, servers=0x0, handle_array=0xb55d41f0, flags=0,
user_ptr=0xb5507c98,
status_user_tag=0, out_status_p=0x9c23348, id=0xbffc11b0,
context_id=0, hints=0xb5506a88) at ../pvfs2_src/src/io/job/job.c:5718
#9 0x0806c3cc in get_handles (smcb=0xb5507c98, js_p=0x9c23348) at
../pvfs2_src/src/server/unstuff.sm:267 <http://unstuff.sm:267>
#10 0x08095e06 in PINT_state_machine_invoke (smcb=0xb5507c98,
r=0x9c23348) at ../pvfs2_src/src/common/misc/state-machine-fns.c:132
#11 0x080961c4 in PINT_state_machine_next (smcb=0xb5507c98,
r=0x9c23348) at ../pvfs2_src/src/common/misc/state-machine-fns.c:309
#12 0x08096200 in PINT_state_machine_continue (smcb=0xb5507c98,
r=0x9c23348) at ../pvfs2_src/src/common/misc/state-machine-fns.c:327
#13 0x0805667c in main (argc=6, argv=0xbffc1334) at
../pvfs2_src/src/server/pvfs2-server.c:413
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
---------------------
PatchSet 7929
Date: 2010/05/13 14:10:06
Author: pcarns
Branch: HEAD
Tag: (none)
Log:
experimental update of fix for precreate bug reported by Bart Taylor (original
fix led to a seg fault).
Members:
src/io/job/job.c:1.191->1.192
Index: pvfs2-1/src/io/job/job.c
diff -u pvfs2-1/src/io/job/job.c:1.191 pvfs2-1/src/io/job/job.c:1.192
--- pvfs2-1/src/io/job/job.c:1.191 Thu May 6 15:42:34 2010
+++ pvfs2-1/src/io/job/job.c Thu May 13 14:10:06 2010
@@ -4428,20 +4428,24 @@
trove_pending_count--;
assert(trove_pending_count >= 0);
- tmp_trove->jd->u.precreate_pool.trove_pending--;
- assert(tmp_trove->jd->u.precreate_pool.trove_pending >= 0);
-
/* don't overwrite error codes from other trove ops */
if(tmp_trove->jd->u.precreate_pool.error_code == 0)
{
tmp_trove->jd->u.precreate_pool.error_code = error_code;
}
+ /* acquiring this mutex a little early so that it can also serve to
+ * prevent multiple trove operations from racing between decrementing
+ * and then reading the pool.trove_pending counter
+ */
+ gen_mutex_lock(&completion_mutex);
+
+ tmp_trove->jd->u.precreate_pool.trove_pending--;
+ assert(tmp_trove->jd->u.precreate_pool.trove_pending >= 0);
+
/* is this job done? */
if(tmp_trove->jd->u.precreate_pool.trove_pending == 0)
{
- gen_mutex_lock(&completion_mutex);
-
/* set job descriptor fields and put into completion queue */
tmp_trove->jd->u.precreate_pool.error_code = 0;
job_desc_q_add(completion_queue_array[tmp_trove->jd->context_id],
@@ -4458,6 +4462,7 @@
return;
}
+ gen_mutex_unlock(&completion_mutex);
return;
}
@@ -5704,8 +5709,8 @@
jd->u.precreate_pool.precreate_handle_index = 0;
jd->u.precreate_pool.fsid = fsid;
jd->u.precreate_pool.servers = servers;
- jd->u.precreate_pool.trove_pending = 0;
jd->u.precreate_pool.flags = flags;
+ jd->u.precreate_pool.trove_pending = 0;
/* rotate to use a different starting server in the pool next time */
gen_mutex_lock(&precreate_pool_mutex);
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers