Hi guys,
We have run into a problem recently with a configuration that looks like
this:
- x86_64 architecture
- 16 servers
- SAN based storage
- approximately 1.4 million files on PVFS
Everything works fine, except when we stop and then later restart one of
the pvfs2-server daemons. At least one of them usually (but not quite
always) crashes before the file system is ready to be mounted.
We captured a core file and can see that it died on this assertion in
the dbpf_dspace_test() function:
dbpf-dspace.c:1371
assert(!dbpf_op_queue_empty(dbpf_completion_queue_array[context_id]));
According to the stack trace, this test() call followed a
trove_dspace_iterate_handles() call within the
trove_check_handle_ranges() function. This is part of the logic on
startup that scans all of the handles in the storage space to update the
list of available/used handles in trove-handle-mgmt.
We found that we can completely work around the problem by manually
setting the coll_p->immediate_completion flag during the
trove_check_handle_ranges() function. That forces the iterate_handles()
function to do all of its processing up front without using a test
function. There is just some sort of bad interaction when the two
functions are used together.
As a side note, setting the "ImmediateCompletion" config file option
does not work around the problem, because that flag does not take effect
until after this assertion occurs. The set_info calls in pvfs2-server
just happen to be in the wrong order. We would probably not have used
this approach anyway, because we haven't fully tested the performance
impact of enabling immediate completion for everything.
Anyone have any suggestions about what the real problem is here? While
the workaround is fine to keep us running for now, it seems like there
is an underlying issue to be addressed.
I apologize that I don't have an exact stack dump to paste in the email,
but if we need any further information from the core file I think I can
still get it loaded up on another machine to look at.
Oh, and one other detail; the memory usage of the servers looks fine
during startup, so this doesn't appear to be a memory leak. There is
quite a bit of CPU work, but I am guessing that is just berkeley db
keeping busy in the iteration function.
thanks,
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers