Ok we narrowed it down to the lookup state machine. It seems like one
of the states was returning complete (1) after posting a job. As a
result the state machine was being freed while the job was still in
progress.
We changed the return value from SM_ACTION_COMPLETE to the return
value of the job and the server stopped crashing in all of our
previous test cases. A patch against HEAD is attached.
--Nick
On Wed, May 28, 2008 at 2:47 PM, David Bonnie <[EMAIL PROTECTED]> wrote:
> Hey all -
>
> Nick and I seem to have found a fairly hefty bug with the server crashing
> when copying to/from a directory. Obviously this could cause some serious
> problems if someone were to crash the server in the middle of writing
> files.
>
> Here's what we've got so far:
>
> Copying to a PVFS folder (using pvfs2-cp) from both local and pvfs2 share
> space:
> Permissions (of destination folder) / Result / Error
>
> 000 / Failure / server crashes on an assert(0)
> 100 / Success / NA
> 200 / Failure / server crashes with a "double free or corruption" error
> 300 / Success / NA
> 400 / Failure / server crashes on an assert(0)
> 500 / Success / NA
> 600 / Failure / server crashes on an assert(0)
> 700 / Success / NA
>
> For 400 and 600, the server debug log says the following:
> "SM current state or trtbl is invalid"
> "state-machine-fns.c:241 PINT_state_machine_next assertion(0)"
>
> As you can see, any write to a folder without execute permissions will
> crash the server.
>
>
> We checked the same things for reading from a PVFS folder (using pvfs2-cp):
> Permissions (of source folder) / Result / Error
>
> 000 / Failure / server crashes on an assert(0)
> 100 / Sucess / NA
> 200 / Failure / server crashes on the same assertion on line 241 as above
> 300 / Failure / server doesn't crash, but client will segfault
> 400 / Failure / server crashes on the same assertion on line 241 as above
> 500 / Success / NA
> 600 / Failure / server crashes on the same assertion on line 241 as above
> 700 / Success / NA
>
> pvfs2-ls -l completes as normal for any combination of permissions.
>
> It seems like one (or more) of the state machines are dumping out early
> and throwing the whole thing out of whack. We recreated the storage space
> between each run that failed to ensure that we weren't working with a
> corrupted filespace (since the server was aborting). Any ideas?
>
> This is happening with the code from HEAD on Red Hat Enterprise 5.
>
> - Dave
>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
Index: src/server/lookup.sm
===================================================================
RCS file: /anoncvs/pvfs2/src/server/lookup.sm,v
retrieving revision 1.57
diff -u -p -r1.57 lookup.sm
--- src/server/lookup.sm 11 Feb 2008 17:25:29 -0000 1.57
+++ src/server/lookup.sm 28 May 2008 20:23:49 -0000
@@ -412,7 +412,8 @@ static PINT_sm_action lookup_check_acls_
js_p,
&i,
server_job_context);
- return SM_ACTION_COMPLETE;
+
+ return ret;
}
/*
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers