On Jun 12, 2006, at 8:34 AM, Phil Carns wrote:
Hi all,
I am looking at an I/O problem that I don't completely understand.
The setup is that there are 15 servers and 20 clients (all RHEL3
SMP). The clients are running a proprietary application. At the
end of the run they each write their share of a data set into a 36
GB file. So each is doing a contiguous write of 36/20 GB of data.
Roughly 80-90% of the time, one or more of the clients and servers
hang, though it isn't predictable which one will do it. After quite
a bit of digging through logs, it turns out that there is a flow
that never completes on the server side. There are two problems
contributing here:
1) The flow timeout mechanism doesn't seem to be working right for
me (the timer for the flow pops and checks progress one time, then
never gets triggered again). This isn't the core problem, but it is
causing the flow to get stuck forever and hold up the request
scheduler rather than giving up eventually.
2) For this particular flow, the client and server appear to
disagree on how much data to exchange. The client sends one 256K
buffer and completes the flow. The server receives that 256K buffer
but still expects at least one more to arrive (which never does).
Hi Phil, these problems sound very similar to the bug I was seeing
about a month ago with the flow code for certain request types. The
particular case I was looking at was with reads, so given that
there's a lot of code duplication in flow I could imagine the same
problem existing (but not getting fixed) for writes. It may not be
the same as the problem you're seeing, but I'll walk through the bug
and how I fixed it, in case it helps.
Basically, the request was ending right at the boundary of a flow
buffer (so bytes == bytemax), this caused the request processing code
to finish and the bmi_send_callback_fn would return, but it would get
called again by the post code for the next buffer, except this time
the request processing would return 0 bytes, and this queue item
would get marked as the last one. It was that last queue item that
was the problem: since the trove_bstream_read_list was never getting
posted, the queue item who's last field was set, was never making it
into the trove_read_callback_fn, so the dest_last_posted field for
the flow data wasn't getting set either.
The fix was to add some extra checking in bmi_send_callback_fn. In
the case that the request processing returns 0 bytes processed, I can
set the dest_last_posted field to 1 immediately. The only caveat to
this is in the case that other queue items in the source list haven't
made it to the dest list yet, so we need to check for that case. The
diff of the changes I made might help:
http://www.pvfs.org/cgi-bin/pvfs2/viewcvs/viewcvs.cgi/pvfs2/src/io/
flow/flowproto-bmi-trove/flowproto-multiqueue.c.diff?r1=1.105&r2=1.106
In any case, I wasn't able to make use of the debug messages for the
request processing code, it was too verbose to make any sense of it.
I think I ended up just adding debug statements for the entry and
exit points of the flow callbacks along with all the different side-
state fields as they were changed. I'm not sure those debug
statements made it into the trunk though. :-(
Since we're here, I think the flow code could benefit from some
reworking in two ways:
1. Less side-state variables to be set. Each queue item has fields
that change the behavior of when things should complete. Its hard to
keep track of all that.
2. More code abstraction. We end up doing very similar things
between reads and writes on the server (and on the client too),
especially with respect to the request processing. It might make
sense to abstract out some of that code duplication into separate
functions.
-sam
I am pretty sure I can get to the bottom of the first problem. The
second one is trickier.
I attached a few log file excerpts:
client-1st.txt: This is the request processing log output from the
client side when it is working on the flow in question
server-1st.txt: request processing log output from the first
iteration of the server side flow
server-2nd.txt: request processing log output from the second
iteration of the server side flow
Anyone have any suggestions, or see any indication from those log
portions about why the server and client arn't on the same page?
Some of this log output (the server-2nd.txt in particular?) looks a
little jumbled, like maybe there are two flows doing request
processing at once, but I don't really know how to interpret it.
I may try adding a mutex around PINT_process_request() next, just to
see if I can make the logs clearer by making sure only one
invocation is going at once.
-Phil