Hi Phil, others,
I think I am seeing the same problem (or some variants of it) with the
NFSv4 bug that I reported a few days ago..
Here is a very simple way that I am able to reproduce it reliably..
Mount a pvfs2 volume and run a few instances of
dd if=/dev/zero of=/mnt/pvfs2/file bs=2M count=100
All the dd instances simply wait for something to happen, client-core and
servers are also waiting for something to happen and no process makes any
progress..
It is perhaps related to the pipe changes to bmi-tcp with poll? Let me
revert those changes and try it out as well..
Thanks,
Murali


On Tue, 13 Jun 2006, Phil Carns wrote:

> I went back and added some much more specific debugging messages and put
> some special prefixes on flow, bmi, and request processor messages so I
> could group them a little easier, and got rid of the extra mutexes.
>
> After running a few more tests and double checking the logs, this is
> what I am seeing in every failure case:
>
> - it is a write flow
> - it gets broken into two buffers within the flow protocol
>    - the first is 256K
>    - the second is 64K
> - both sides _do_ get the request processing correct
> - both sides post the correct BMI operations
>    - the server posts 2 receives
>    - the client posts 2 sends
> - the second data message never shows up at the server
>
> I must have missed part of the flow/request processing on the client
> side when I looked before.  Sorry about the wild goose chase with the
> request process question.  The request processing definitely looks
> correct in the logs that I am seeing now (or at least both the client
> and server agree on it anyway).
>
> BMI has logging that shows when incoming messages appear on a socket,
> and it includes the tag number.  The 64K message never shows up on a
> socket at the server side.  The BMI operation therefore never completes,
> and the callback never gets triggered to complete the flow.
>
> I don't know if this is a bmi_tcp problem, or if there is some sort of
> general network wackiness in my environment.  I'll try to revert to some
> older versions of bmi_tcp tomorrow and see if the problem persists.  I
> already tried just backing out changes to the poll notification in
> socket-collection.[ch] and (assuming I got that right) it didn't make
> any difference.
>
> -Phil
>
> Phil Carns wrote:
> > Hi Sam,
> >
> > I may be wrong, but so far it looks like the problem is a little
> > different this time.  In the scenario that I am seeing now it doesn't
> > look like the flow has finished everything and then just failed to mark
> > completion- it looks like one side actually posted more BMI messages
> > than the other.  The server posted two receives of 256K, while the
> > client only posted one send of 256K.  I don't see any zero byte results
> > from the request processor or any cases where the flow should have
> > marked completion.
> >
> > I'm still trying to get a better log.  After adding mutexes in
> > PINT_process_request() on the server side, the problem has gotten much
> > harder to reproduce.  I don't know if there is any correlation or if I
> > am just perturbing the timing enough to hide the problem.  So far out of
> > 9 runs I have had only 1 failure, and the logs from that failure were
> > inconclusive.
> >
> > I definitely agree with both of your points about reworking the flow
> > code; it is difficult to debug right now.
> >
> > -Phil
> >
> > Sam Lang wrote:
> >
> >>
> >> On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
> >>
> >>> Hi all,
> >>>
> >>> I am looking at an I/O problem that I don't completely understand.
> >>> The setup is that there are 15 servers and 20 clients (all RHEL3
> >>> SMP).  The clients are running a proprietary application.  At the
> >>> end of the run they each write their share of a data set into a 36
> >>> GB file.  So each is doing a contiguous write of 36/20 GB of data.
> >>>
> >>> Roughly 80-90% of the time, one or more of the clients and servers
> >>> hang, though it isn't predictable which one will do it.  After  quite
> >>> a bit of digging through logs, it turns out that there is a  flow
> >>> that never completes on the server side.  There are two  problems
> >>> contributing here:
> >>>
> >>> 1) The flow timeout mechanism doesn't seem to be working right for
> >>> me (the timer for the flow pops and checks progress one time, then
> >>> never gets triggered again).  This isn't the core problem, but it  is
> >>> causing the flow to get stuck forever and hold up the request
> >>> scheduler rather than giving up eventually.
> >>>
> >>> 2) For this particular flow, the client and server appear to
> >>> disagree on how much data to exchange.  The client sends one 256K
> >>> buffer and completes the flow.  The server receives that 256K  buffer
> >>> but still expects at least one more to arrive (which never  does).
> >>>
> >>
> >> Hi Phil, these problems sound very similar to the bug I was seeing
> >> about a month ago with the flow code for certain request types.  The
> >> particular case I was looking at was with reads, so given that
> >> there's a lot of code duplication in flow I could imagine the same
> >> problem existing (but not getting fixed) for writes.  It may not be
> >> the same as the problem you're seeing, but I'll walk through the bug
> >> and how I fixed it, in case it helps.
> >>
> >> Basically, the request was ending right at the boundary of a flow
> >> buffer (so bytes == bytemax), this caused the request processing code
> >> to finish and the bmi_send_callback_fn would return, but it would get
> >> called again by the post code for the next buffer, except this time
> >> the request processing would return 0 bytes, and this queue item
> >> would get marked as the last one.  It was that last queue item that
> >> was the problem: since the trove_bstream_read_list was never getting
> >> posted, the queue item who's last field was set, was never making it
> >> into the trove_read_callback_fn, so the dest_last_posted field for
> >> the flow data wasn't getting set either.
> >>
> >> The fix was to add some extra checking in bmi_send_callback_fn.  In
> >> the case that the request processing returns 0 bytes processed, I can
> >> set the dest_last_posted field to 1 immediately.  The only caveat to
> >> this is in the case that other queue items in the source list haven't
> >> made it to the dest list yet, so we need to check for that case.  The
> >> diff of the changes I made might help:
> >>
> >> http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
> >> flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
> >>
> >> In any case, I wasn't able to make use of the debug messages for the
> >> request processing code, it was too verbose to make any sense of it.
> >> I think I ended up just adding debug statements for the entry and
> >> exit points of the flow callbacks along with all the different side-
> >> state fields as they were changed.  I'm not sure those debug
> >> statements made it into the trunk though. :-(
> >>
> >> Since we're here, I think the flow code could benefit from some
> >> reworking in two ways:
> >>
> >> 1.  Less side-state variables to be set.  Each queue item has fields
> >> that change the behavior of when things should complete.  Its hard to
> >> keep track of all that.
> >>
> >> 2.  More code abstraction.  We end up doing very similar things
> >> between reads and writes on the server (and on the client too),
> >> especially with respect to the request processing.  It might make
> >> sense to abstract out some of that code duplication into separate
> >> functions.
> >>
> >> -sam
> >>
> >>> I am pretty sure I can get to the bottom of the first problem.  The
> >>> second one is trickier.
> >>>
> >>> I attached a few log file excerpts:
> >>>
> >>> client-1st.txt: This is the request processing log output from the
> >>> client side when it is working on the flow in question
> >>>
> >>> server-1st.txt: request processing log output from the first
> >>> iteration of the server side flow
> >>>
> >>> server-2nd.txt: request processing log output from the second
> >>> iteration of the server side flow
> >>>
> >>> Anyone have any suggestions, or see any indication from those log
> >>> portions about why the server and client arn't on the same page?
> >>>
> >>> Some of this log output (the server-2nd.txt in particular?) looks a
> >>> little jumbled, like maybe there are two flows doing request
> >>> processing at once, but I don't really know how to interpret it.
> >>>
> >>> I may try adding a mutex around PINT_process_request() next, just  to
> >>> see if I can make the logs clearer by making sure only one
> >>> invocation is going at once.
> >>>
> >>> -Phil
>
> _______________________________________________
> Pvfs2-developers mailing list
> [email protected]
> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
>
>
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to