Re: [Pvfs2-developers] server crash on startup with millions of files

Phil Carns Fri, 23 Feb 2007 05:48:17 -0800

I tried out Pete's suggestion and all of the nodes ran overnight withoutany trouble, about 7 million scans per server so far.

The only modification was to the DBPF_COMPLETION_START macro: gettingthe queue mutex first and also doing a queue add before touching the state.


I think the change is safe to commit if it looks ok on your end...

-Phil

Phil Carns wrote:

It ended up taking a little work to get another environment to triggerthis reliably, but I think I have something now.
I modified the iterate_handles() function a bit so that it keepsscanning over and over again indefinitely rather than letting the serverstart up. This forces the code path in question without having torestart the servers. Using this setup I'm able to trigger it on anempty 8 node file system, but I have to leave all of the servers runningon it anywhere from a few minutes to half an hour before one of themcrashes. Oddly enough, with this environment it crashes faster on anemtpy file system than one with 500,000 files.
I repeated this test with the latest HEAD version from trunk, and thatdidn't seem to make any difference.
I'll try the mutex suggestion next.

-Phil

Sam Lang wrote:
On Feb 20, 2007, at 11:32 AM, Pete Wyckoff wrote:
[EMAIL PROTECTED] wrote on Tue, 20 Feb 2007 07:29 -0500:
dbpf-dspace.c:1371
assert(!dbpf_op_queue_empty(dbpf_completion_queue_array [context_id]));

According to the stack trace, this test() call followed a
trove_dspace_iterate_handles() call within the
trove_check_handle_ranges() function.  This is part of the logic on
startup that scans all of the handles in the storage space toupdate the
list of available/used handles in trove-handle-mgmt.
Another thought for Sam, who knows this code better.

(1) DBPF_COMPLETION_START modifies cur_op->op.state without holding  the
dbpf_completion_queue_array_mutex[cid] mutex.  Then it grabs the
mutex and puts the op on the completion array.

(2) dbpf_dspace_test grabs that mutex, looks at op.state, then asserts
that the queue must not be empty.

Perhaps (1) modifies the state but doesn't get around to putting it
on the completion array.  (Possibly because the lock is held in
(2).)
Good point Pete. Given that this seems to be race Phil is seeing,your theory seems more likely.
Maybe (1) should put the op on the array before modifying its state,
and hold the array mutex the whole time.  I'm not sure what kind of
locking rules are involved between the mutex in the op and the mutex
on the completion array, though.  Or what else might break with such
a change.
I don't think there should be any problems with doing this. Itprobably doesn't matter when the op is added to the completion queue(before or after its state gets changed), just that the completionqueue's mutex gets locked before either (at the top ofDBPF_COMPLETION_START). I wonder if Phil could make this change andrun his tests again.
        -- Pete
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers


_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Re: [Pvfs2-developers] server crash on startup with millions of files

Reply via email to