Hi guys,

We have run into a problem recently with a configuration that looks like this:

- x86_64 architecture
- 16 servers
- SAN based storage
- approximately 1.4 million files on PVFS

Everything works fine, except when we stop and then later restart one of the pvfs2-server daemons. At least one of them usually (but not quite always) crashes before the file system is ready to be mounted.

We captured a core file and can see that it died on this assertion in the dbpf_dspace_test() function:

dbpf-dspace.c:1371
assert(!dbpf_op_queue_empty(dbpf_completion_queue_array[context_id]));

According to the stack trace, this test() call followed a trove_dspace_iterate_handles() call within the trove_check_handle_ranges() function. This is part of the logic on startup that scans all of the handles in the storage space to update the list of available/used handles in trove-handle-mgmt.

We found that we can completely work around the problem by manually setting the coll_p->immediate_completion flag during the trove_check_handle_ranges() function. That forces the iterate_handles() function to do all of its processing up front without using a test function. There is just some sort of bad interaction when the two functions are used together.

As a side note, setting the "ImmediateCompletion" config file option does not work around the problem, because that flag does not take effect until after this assertion occurs. The set_info calls in pvfs2-server just happen to be in the wrong order. We would probably not have used this approach anyway, because we haven't fully tested the performance impact of enabling immediate completion for everything.

Anyone have any suggestions about what the real problem is here? While the workaround is fine to keep us running for now, it seems like there is an underlying issue to be addressed.

I apologize that I don't have an exact stack dump to paste in the email, but if we need any further information from the core file I think I can still get it loaded up on another machine to look at.

Oh, and one other detail; the memory usage of the servers looks fine during startup, so this doesn't appear to be a memory leak. There is quite a bit of CPU work, but I am guessing that is just berkeley db keeping busy in the iteration function.

thanks,
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to