On Jun 14, 2006, at 5:05 PM, Avery Ching wrote:
Certainly I was able to at least identify one bug I think. The smallI/O path is being used based on whether the amount of data going to the I/O servers is below the max_unexep_payload. However, when request getsto the server, small-io.sm calls PINT_Process_request() once and then calls job_trove_bstream_write_list once. Then it returns. If the number of stream offset-length pairs generated is greater than SMALL_IO_MAX_REGIONS then the operation won't finish. This won't show up in the list I/O path since we break it up on 64 ol-pairs. It shows up on the datatype I/O path since it doesn't get broken up. You could probably trigger it in list I/O by just making SMALL_IO_MAX_REGIONS smaller. Suggestions for fixing: 1) (Preferred) Loop around the job_trove_bstream_write_list and job_trove_bstream_read_list calls to keep moving data until the entire datatype has been satisfied. 2) (Alternative) Make the offset-length pairs limit part of the requirement for small I/O.
Thanks for debugging this Avery. For now I went with option #2 since its easier :-). If you find that small IO is a big improvement for list io then we can change it to do option 1. Can you let me know if this patch fixes the problem for you?
Thanks, -sam
smallio.patch
Description: Binary data
Avery On Tue, 2006-06-13 at 17:44 -0500, Avery Ching wrote:I was able to repeat the bug on the 4 server 20 client setup you had. I also made it happen on 1 client and 2 servers. It seems to work fine with 1 server and 1 client or 1 server and 20 clients, therefore, this probably is a multi-server issue. I'll investigate further and let you know theprogress. I hope it's not another one of those PINT_Process_req() of flow type problems! Avery On Mon, 12 Jun 2006, Avery Ching wrote:Yeah I have. I am not sure exactly what the problem is to be honest.Basically that error messsage is just reporting what it got from thePVFS_Sys_write() call. Therefore, it could be a lot of things. The odd thing though is that it seems to happen at random places. The test works fine for other sizes, just fails on certain ones. I'm wondering whetherit's related to the flow or Pint_process_request() problems we'vebeen seeing on the listserv. Oddly enough, when I did my IPDPS testing, Inever ran into that issue for write, only for read (just sometimes - hence I had no read results =) ).Unfortunately, debugging the flow and PINT_process_req() areas is quite difficult. I'll try and look into it a bit though. At least see if I canrepeat the bug. Averysuspect that the write call is not returning the correct amount of dataprocessed. On Mon, 12 Jun 2006, Robert Latham wrote:Hi Avery I've got another hpio bug: with 4 servers, 20 clients, hpio ran for a long long time and then died like this: write | region_count | c-nc | datatype----------------time (seconds)--------------|-bandwidth (MB/ s)|---test type--- open | io | sync | close | total | IO | IOsyn | region_count 0.062 | 8.160 | 0.208 | 0.000 | 8.429 | 0.031 | 0.030 | 2048 ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write returned -1610612737 and completed -4611717612071138032 bytes. ADIOI_PVFS2_StridedDtypeIO: Warning - PVFS_sys_read/write returned -1610612737 and completed -4611717612071081488 bytes.Seen anything like this before? ==rob -- Rob LathamMathematics and Computer Science Division A215 0178 EA2D B059 8CDF Argonne National Labs, IL USA B29D F333 664A 4280 315B
_______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
