Sam Lang wrote:
Hi All,
I think Nawab has found a bug (or untested code path) in the BMI tcp
method. He's running a daemon that both receives unexpected requests
(as a server), and receives expected responses (as a client).
In the BMI_testcontext call, if there aren't any completed (expected)
operations, and there are completed unexpected receives, we return
immediately, assuming that BMI_testunexpected will be called in turn. I
think the idea here is that we want to keep our latency down for
unexpected messages, instead of doing work on expected messages while
unexpected messages are waiting in the hopper. But the daemon is single
threaded, and making blocking PVFS_sys_* calls, so we essentially spin
forever calling BMI_testcontext over and over.
I'm not sure of the best way to fix this. Easy fixes would be to remove
the check for completed unexpected receives, and/or do tcp_do_work for a
shorter timeout.
It seems like we have a special case for blocking PVFS_sys_* calls. We
want to ignore unexpected receives just in that case, and actually call
tcp_do_work. In other contexts, I think we want the behavior that we
have now, where we assume that a BMI_testunexpected call will follow a
BMI_testcontext call. We could modify the testcontext call to take a
separate parameter, but that seems messy. We might also be able to
handle this with separate BMI contexts somehow...
I haven't dug in the code yet to see if I see any more elegant way to
handle it, but I wanted to mention that if you want to add a special
flag to toggle the behavior, it might be better to just set it globally
with the set_info() function rather than modifying the testcontext()
api. That way you don't have to change any of the other BMI methods.
There are already a couple of similar set_info() calls to toggle BMI
behavior for different use cases.
-Phil
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers