Certainly I was able to at least identify one bug I think. The small I/O path is being used based on whether the amount of data going to the I/O servers is below the max_unexep_payload. However, when request gets to the server, small-io.sm calls PINT_Process_request() once and then calls job_trove_bstream_write_list once. Then it returns. If the number of stream offset-length pairs generated is greater than SMALL_IO_MAX_REGIONS then the operation won't finish. This won't show up in the list I/O path since we break it up on 64 ol-pairs. It shows up on the datatype I/O path since it doesn't get broken up. You could probably trigger it in list I/O by just making SMALL_IO_MAX_REGIONS smaller.
Suggestions for fixing: 1) (Preferred) Loop around the job_trove_bstream_write_list and job_trove_bstream_read_list calls to keep moving data until the entire datatype has been satisfied. 2) (Alternative) Make the offset-length pairs limit part of the requirement for small I/O. Avery On Tue, 2006-06-13 at 17:44 -0500, Avery Ching wrote: > I was able to repeat the bug on the 4 server 20 client setup you had. I > also made it happen on 1 client and 2 servers. It seems to work fine with > 1 server and 1 client or 1 server and 20 clients, therefore, this probably > is a multi-server issue. I'll investigate further and let you know the > progress. I hope it's not another one of those PINT_Process_req() of > flow type problems! > > Avery > > On Mon, 12 Jun 2006, Avery Ching wrote: > > > Yeah I have. I am not sure exactly what the problem is to be honest. > > Basically that error messsage is just reporting what it got from the > > PVFS_Sys_write() call. Therefore, it could be a lot of things. The odd > > thing though is that it seems to happen at random places. The test works > > fine for other sizes, just fails on certain ones. I'm wondering whether > > it's related to the flow or Pint_process_request() problems we've > > been seeing on the listserv. Oddly enough, when I did my IPDPS testing, I > > never ran into that issue for write, only for read (just sometimes - > > hence I had no read results =) ). > > > > Unfortunately, debugging the flow and PINT_process_req() areas is quite > > difficult. I'll try and look into it a bit though. At least see if I can > > repeat the bug. > > > > Avery > > > > suspect that the write call is not returning the correct amount of data > > processed. > > > > On Mon, 12 Jun 2006, Robert Latham wrote: > > > > > Hi Avery > > > I've got another hpio bug: > > > > > > with 4 servers, 20 clients, hpio ran for a long long time and then > > > died like this: > > > write | region_count | c-nc | datatype > > > ----------------time (seconds)--------------|-bandwidth (MB/s)|---test > > > type--- > > > open | io | sync | close | total | IO | IOsyn | > > > region_count > > > 0.062 | 8.160 | 0.208 | 0.000 | 8.429 | 0.031 | 0.030 | 2048 > > > ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write returned > > > -1610612737 and completed -4611717612071138032 bytes. > > > ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write returned > > > -1610612737 and completed -4611717612071081488 bytes. > > > > > > Seen anything like this before? > > > ==rob > > > > > > -- > > > Rob Latham > > > Mathematics and Computer Science Division A215 0178 EA2D B059 8CDF > > > Argonne National Labs, IL USA B29D F333 664A 4280 315B > > > > > _______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
