Hi Phil, others, I think I am seeing the same problem (or some variants of it) with the NFSv4 bug that I reported a few days ago.. Here is a very simple way that I am able to reproduce it reliably.. Mount a pvfs2 volume and run a few instances of dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100 All the dd instances simply wait for something to happen, client-core and servers are also waiting for something to happen and no process makes any progress.. It is perhaps related to the pipe changes to bmi-tcp with poll? Let me revert those changes and try it out as well.. Thanks, Murali
On Tue, 13 Jun 2006, Phil Carns wrote: > I went back and added some much more specific debugging messages and put > some special prefixes on flow, bmi, and request processor messages so I > could group them a little easier, and got rid of the extra mutexes. > > After running a few more tests and double checking the logs, this is > what I am seeing in every failure case: > > - it is a write flow > - it gets broken into two buffers within the flow protocol > - the first is 256K > - the second is 64K > - both sides _do_ get the request processing correct > - both sides post the correct BMI operations > - the server posts 2 receives > - the client posts 2 sends > - the second data message never shows up at the server > > I must have missed part of the flow/request processing on the client > side when I looked before. Sorry about the wild goose chase with the > request process question. The request processing definitely looks > correct in the logs that I am seeing now (or at least both the client > and server agree on it anyway). > > BMI has logging that shows when incoming messages appear on a socket, > and it includes the tag number. The 64K message never shows up on a > socket at the server side. The BMI operation therefore never completes, > and the callback never gets triggered to complete the flow. > > I don't know if this is a bmi_tcp problem, or if there is some sort of > general network wackiness in my environment. I'll try to revert to some > older versions of bmi_tcp tomorrow and see if the problem persists. I > already tried just backing out changes to the poll notification in > socket-collection.[ch] and (assuming I got that right) it didn't make > any difference. > > -Phil > > Phil Carns wrote: > > Hi Sam, > > > > I may be wrong, but so far it looks like the problem is a little > > different this time. In the scenario that I am seeing now it doesn't > > look like the flow has finished everything and then just failed to mark > > completion- it looks like one side actually posted more BMI messages > > than the other. The server posted two receives of 256K, while the > > client only posted one send of 256K. I don't see any zero byte results > > from the request processor or any cases where the flow should have > > marked completion. > > > > I'm still trying to get a better log. After adding mutexes in > > PINT_process_request() on the server side, the problem has gotten much > > harder to reproduce. I don't know if there is any correlation or if I > > am just perturbing the timing enough to hide the problem. So far out of > > 9 runs I have had only 1 failure, and the logs from that failure were > > inconclusive. > > > > I definitely agree with both of your points about reworking the flow > > code; it is difficult to debug right now. > > > > -Phil > > > > Sam Lang wrote: > > > >> > >> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote: > >> > >>> Hi all, > >>> > >>> I am looking at an I/O problem that I don't completely understand. > >>> The setup is that there are 15 servers and 20 clients (all RHEL3 > >>> SMP). The clients are running a proprietary application. At the > >>> end of the run they each write their share of a data set into a 36 > >>> GB file. So each is doing a contiguous write of 36/20 GB of data. > >>> > >>> Roughly 80-90% of the time, one or more of the clients and servers > >>> hang, though it isn't predictable which one will do it. After quite > >>> a bit of digging through logs, it turns out that there is a flow > >>> that never completes on the server side. There are two problems > >>> contributing here: > >>> > >>> 1) The flow timeout mechanism doesn't seem to be working right for > >>> me (the timer for the flow pops and checks progress one time, then > >>> never gets triggered again). This isn't the core problem, but it is > >>> causing the flow to get stuck forever and hold up the request > >>> scheduler rather than giving up eventually. > >>> > >>> 2) For this particular flow, the client and server appear to > >>> disagree on how much data to exchange. The client sends one 256K > >>> buffer and completes the flow. The server receives that 256K buffer > >>> but still expects at least one more to arrive (which never does). > >>> > >> > >> Hi Phil, these problems sound very similar to the bug I was seeing > >> about a month ago with the flow code for certain request types. The > >> particular case I was looking at was with reads, so given that > >> there's a lot of code duplication in flow I could imagine the same > >> problem existing (but not getting fixed) for writes. It may not be > >> the same as the problem you're seeing, but I'll walk through the bug > >> and how I fixed it, in case it helps. > >> > >> Basically, the request was ending right at the boundary of a flow > >> buffer (so bytes == bytemax), this caused the request processing code > >> to finish and the bmi_send_callback_fn would return, but it would get > >> called again by the post code for the next buffer, except this time > >> the request processing would return 0 bytes, and this queue item > >> would get marked as the last one. It was that last queue item that > >> was the problem: since the trove_bstream_read_list was never getting > >> posted, the queue item who's last field was set, was never making it > >> into the trove_read_callback_fn, so the dest_last_posted field for > >> the flow data wasn't getting set either. > >> > >> The fix was to add some extra checking in bmi_send_callback_fn. In > >> the case that the request processing returns 0 bytes processed, I can > >> set the dest_last_posted field to 1 immediately. The only caveat to > >> this is in the case that other queue items in the source list haven't > >> made it to the dest list yet, so we need to check for that case. The > >> diff of the changes I made might help: > >> > >> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/ > >> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106 > >> > >> In any case, I wasn't able to make use of the debug messages for the > >> request processing code, it was too verbose to make any sense of it. > >> I think I ended up just adding debug statements for the entry and > >> exit points of the flow callbacks along with all the different side- > >> state fields as they were changed. I'm not sure those debug > >> statements made it into the trunk though. :-( > >> > >> Since we're here, I think the flow code could benefit from some > >> reworking in two ways: > >> > >> 1. Less side-state variables to be set. Each queue item has fields > >> that change the behavior of when things should complete. Its hard to > >> keep track of all that. > >> > >> 2. More code abstraction. We end up doing very similar things > >> between reads and writes on the server (and on the client too), > >> especially with respect to the request processing. It might make > >> sense to abstract out some of that code duplication into separate > >> functions. > >> > >> -sam > >> > >>> I am pretty sure I can get to the bottom of the first problem. The > >>> second one is trickier. > >>> > >>> I attached a few log file excerpts: > >>> > >>> client-1st.txt: This is the request processing log output from the > >>> client side when it is working on the flow in question > >>> > >>> server-1st.txt: request processing log output from the first > >>> iteration of the server side flow > >>> > >>> server-2nd.txt: request processing log output from the second > >>> iteration of the server side flow > >>> > >>> Anyone have any suggestions, or see any indication from those log > >>> portions about why the server and client arn't on the same page? > >>> > >>> Some of this log output (the server-2nd.txt in particular?) looks a > >>> little jumbled, like maybe there are two flows doing request > >>> processing at once, but I don't really know how to interpret it. > >>> > >>> I may try adding a mutex around PINT_process_request() next, just to > >>> see if I can make the logs clearer by making sure only one > >>> invocation is going at once. > >>> > >>> -Phil > > _______________________________________________ > Pvfs2-developers mailing list > [email protected] > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers > > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
