Certainly I was able to at least identify one bug I think.  The small
I/O path is being used based on whether the amount of data going to the
I/O servers is below the max_unexep_payload.  However, when request gets
to the server, small-io.sm calls PINT_Process_request() once and then
calls job_trove_bstream_write_list once.  Then it returns.  If the
number of stream offset-length pairs generated is greater than
SMALL_IO_MAX_REGIONS then the operation won't finish.  This won't show
up in the list I/O path since we break it up on 64 ol-pairs.  It shows
up on the datatype I/O path since it doesn't get broken up.  You could
probably trigger it in list I/O by just making SMALL_IO_MAX_REGIONS
smaller.

Suggestions for fixing:

1) (Preferred) Loop around the job_trove_bstream_write_list and
job_trove_bstream_read_list calls to keep moving data until the entire
datatype has been satisfied.

2) (Alternative) Make the offset-length pairs limit part of the
requirement for small I/O.

Avery

On Tue, 2006-06-13 at 17:44 -0500, Avery Ching wrote:
> I was able to repeat the bug on the 4 server 20 client setup you had.  I 
> also made it happen on 1 client and 2 servers.  It seems to work fine with 
> 1 server and 1 client or 1 server and 20 clients, therefore, this probably 
> is a multi-server issue.  I'll investigate further and let you know the 
> progress.  I hope it's not another one of those PINT_Process_req() of 
> flow type problems!  
> 
> Avery
> 
> On Mon, 12 Jun 2006, Avery Ching wrote:
> 
> > Yeah I have.  I am not sure exactly what the problem is to be honest.  
> > Basically that error messsage is just reporting what it got from the 
> > PVFS_Sys_write() call.  Therefore, it could be a lot of things.  The odd 
> > thing though is that it seems to happen at random places.  The test works 
> > fine for other sizes, just fails on certain ones. I'm wondering whether 
> > it's related to the flow or Pint_process_request() problems we've 
> > been seeing on the listserv.  Oddly enough, when I did my IPDPS testing, I 
> > never ran into that issue for write, only for read (just sometimes - 
> > hence I had no read results =) ).
> > 
> > Unfortunately, debugging the flow and PINT_process_req() areas is quite 
> > difficult.  I'll try and look into it a bit though.  At least see if I can 
> > repeat the bug.
> > 
> > Avery
> >  
> > suspect that the write call is not returning the correct amount of data 
> > processed.
> > 
> > On Mon, 12 Jun 2006, Robert Latham wrote:
> > 
> > > Hi Avery
> > > I've got another hpio bug:
> > > 
> > > with 4 servers, 20 clients, hpio ran for a long long time and then
> > > died like this:
> > > write | region_count | c-nc | datatype
> > > ----------------time (seconds)--------------|-bandwidth (MB/s)|---test 
> > > type---
> > >   open  |   io   |  sync  | close  | total  |   IO   |  IOsyn | 
> > > region_count
> > >   0.062 |  8.160 |  0.208 |  0.000 |  8.429 |  0.031 |  0.030 | 2048
> > > ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write returned 
> > > -1610612737 and completed -4611717612071138032 bytes.
> > > ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write returned 
> > > -1610612737 and completed -4611717612071081488 bytes.
> > > 
> > > Seen anything like this before?
> > > ==rob
> > > 
> > > -- 
> > > Rob Latham
> > > Mathematics and Computer Science Division    A215 0178 EA2D B059 8CDF
> > > Argonne National Labs, IL USA                B29D F333 664A 4280 315B
> > > 
> > 

_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to