Great find guys. It looks like this was introduced with the SM
changes a while back -- maybe no one removes the execute bit from
their directories or we hopefully would have seen this sooner?
Another motivating instance for getting a good unit testing framework
and code coverage analysis setup.
Can you commit the fix to head?
Thanks,
-sam
On May 28, 2008, at 3:29 PM, Nicholas Mills wrote:
Ok we narrowed it down to the lookup state machine. It seems like one
of the states was returning complete (1) after posting a job. As a
result the state machine was being freed while the job was still in
progress.
We changed the return value from SM_ACTION_COMPLETE to the return
value of the job and the server stopped crashing in all of our
previous test cases. A patch against HEAD is attached.
--Nick
On Wed, May 28, 2008 at 2:47 PM, David Bonnie <[EMAIL PROTECTED]
> wrote:
Hey all -
Nick and I seem to have found a fairly hefty bug with the server
crashing
when copying to/from a directory. Obviously this could cause some
serious
problems if someone were to crash the server in the middle of writing
files.
Here's what we've got so far:
Copying to a PVFS folder (using pvfs2-cp) from both local and pvfs2
share
space:
Permissions (of destination folder) / Result / Error
000 / Failure / server crashes on an assert(0)
100 / Success / NA
200 / Failure / server crashes with a "double free or corruption"
error
300 / Success / NA
400 / Failure / server crashes on an assert(0)
500 / Success / NA
600 / Failure / server crashes on an assert(0)
700 / Success / NA
For 400 and 600, the server debug log says the following:
"SM current state or trtbl is invalid"
"state-machine-fns.c:241 PINT_state_machine_next assertion(0)"
As you can see, any write to a folder without execute permissions
will
crash the server.
We checked the same things for reading from a PVFS folder (using
pvfs2-cp):
Permissions (of source folder) / Result / Error
000 / Failure / server crashes on an assert(0)
100 / Sucess / NA
200 / Failure / server crashes on the same assertion on line 241 as
above
300 / Failure / server doesn't crash, but client will segfault
400 / Failure / server crashes on the same assertion on line 241 as
above
500 / Success / NA
600 / Failure / server crashes on the same assertion on line 241 as
above
700 / Success / NA
pvfs2-ls -l completes as normal for any combination of permissions.
It seems like one (or more) of the state machines are dumping out
early
and throwing the whole thing out of whack. We recreated the
storage space
between each run that failed to ensure that we weren't working with a
corrupted filespace (since the server was aborting). Any ideas?
This is happening with the code from HEAD on Red Hat Enterprise 5.
- Dave
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
<lookup.patch>_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers