The app does a sequential write, using a maximum of NUMBUF (currently
16) buffers for async writes, posts them one at a time with
pvfs2_aio_flush, and then checks the buffer before re-using it with
pvfs2_aio_check.
Then it does 'rewindfile', and immediately starts the read.. So if the
file is small enough to fit in NUMBUF*BUFFER_SIZE, I could issue a read
for the end of the file before the write for the data is retired. But
I'm using a big enough file that the call to PVFS_sys_wait() for the
write is guaranteed to have completed. Unless of course I have some
weird logic error in my read-ahead in get_next_full_readbuf.
A very simple test harness that I just reproduced the problem (along
with the IO shim) is at:
http://www.scl.ameslab.gov/~troy/pvfs/corruption/io.tar.gz
So *if* I have enough printfs that the printing takes longer than the IO
(probably include the RTT to stuff it over my DSL link from home), I get
this:
R 65010 READFILE 65010
R 65011 READFILE 65011
R 65012 READFILE 65012
get_next_full_readbuf: enter userFilePos 1073741824, read_ahead_offset
1107296256
-- buffer 0x100de290 curbuf 2 buffer_ahead_offset 15 size 0
read_ahead_offset 1107296256
get_next_full_readbuf at end of file, no more bufs to fill
pvfs2_aio_check id 0 buf 2 aio_req 0x10104e78 b->offset 1073741824
b->size 33554432
ERROR pvfs2_aio_check called to PVFS_sys_wait for op_id 321 got error 0
ERROR pvfs2_aio_check: buf 2 offset 1073741824 b->size (33554432) !=
total_completed (29360128)
run 'gdb -p 6888' to debug
run 'gdb -p 6888' to debug
If the file already exists, and is big enough, it works just fine.
So there is some race there that is notoriously timing sensitive, thus
has been really erratic for us to reproduce easily. I have suspicions it
will only happen on BMI layers that are fully hardware asynch as well.
(aka, !tcp)
Phil Carns wrote:
Could you break down what the app is doing at a little bit higher
level in this time frame? (ie, how many writes is it posting, how
many reads is it posting, which are concurrent, when it calls wait for
each).
From what I can tell, it looks like there are 30 total isys_io's
posted; the first 15 are writes (triggered by pvfs2_aio_flush) and the
last 15 are reads (triggered by pvfs2_aio_fill). It doesn't look like
there are any waits in between the two, though, if I am reading it right.
Are you calling wait() (or some other variant) between the writes and
the reads? The pvfs2 system interface doesn't order any of the
operations, so it might just be that some of your reads happen to be
hitting the server before your writes have put the data there.
This is different from the standard posix aio; I think their api
automatically orders every I/O operation at least at a file descriptor
level. The PVFS system interface doesn't do anything like that to
prevent I/O operations from getting out of order once they are posted.
_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers