Hi Hagai,Sorry I didn't pick up on this before, It looks like you're right. I've attached a patch that I think should fix this race problem. Can you try it out and let me know if it works for you?
Thanks, -sam
sys-io-race-fix.patch
Description: Binary data
On Jul 18, 2007, at 7:50 AM, Hagai Avrahami wrote:
Hi Sam When running my application I got this assert sometimes "src/client/sysint/sys-io.sm:1860: io_post_write_ack_recv: Assertion `ret == 0' failed:"I tried to get this assert by running one of the tests included in pvfs2package I build this setup:4 pvfs server with storage space on the same machine and client runs as wellon this machine. I made small modification in io-stress.c: 1. I linked Test module with pvfs-threaded instead of pvfs only 2. I changed write block size to 1MB 2. I am running the test in a loop and writing in each iteration 50MB. Running this test reproduce the problem. I realized that:1. This problem seems like a race between the thread responsible of sendingWrite Buffer and BMI thread.2. Because everything runs on the same machine and latency of message isVery low the reply returns very fast.3. To make this assumption more clear to me, I run all 4 pvfs servers on aRAM Disk and the Assert reproduced in every run. Appreciate any help Thanx Hagai -----Original Message----- From: Sam Lang [mailto:[EMAIL PROTECTED] Sent: Tuesday, July 03, 2007 3:37 PM To: Hagai Avrahami Subject: Re: Question On Jul 2, 2007, at 12:38 PM, Hagai Avrahami wrote:Hi Sam I took io-stress changed it a little bit and then tried to write blocks after EOF and it all went good. So I guess it something I did wrong in my code.... Maybe you have any suggestions what can lead to this assert?Not without knowing more about the code you've written. My best guess would be that you have memory corruption somewhere else and its causing an erroneous error in your code. That's unlikely though. You could try running it in valgrind and see if you get any errors.Do PVFS2 fill the gaps with 0(zero)?Yes.When I am trying to write to offset after EOF how does PVFS2 knows it has passed EOF (how does it know the size of file?), as I understood Get Size is operations involving query of all IO severs in the collection.Yes each IO server's stripe size is maintained on the IO server. The client doesn't need to *know* that you're writing to an offset past EOF, it just determines the writes that need to be made to each server. Determining the size of the file and the end of file are only important for a read operation. -samThanx Hagai -----Original Message----- From: Sam Lang [mailto:[EMAIL PROTECTED] Sent: Sunday, July 01, 2007 8:24 PM To: Hagai Avrahami Subject: Re: Question Hi Hagai, What you're trying to do should work (writing past EOF). Do you have a test program that I could run to reproduce the problem? -sam On Jul 1, 2007, at 1:56 PM, Hagai Avrahami wrote:Hi Sam I Use PVFS_isys_io(....) to write data And I use PVFS_sys_testsome(...) to fetch all finished operations. I am trying to write file with size of 512MB, Stripe size of 1MB. Every 3 continuous block I write next one with gap of 1 MB.... Write 1(Offset - 0), Write 2(Offset 1MB), Write 2(Offset 2MB), Write 2(Offset 4MB), Write 2(Offset 5MB), Write 2(Offset 6MB), Write 2(Offset 8MB), If am writing with no gaps I don't get this Assert and with the gaps It happens every time during the write. Can't I write to offset bigger than the size of the file? I Assumed that if I do so, PVFS2 will complete the gap with 0(zero)? Is this true? Appreciate your help Thanx Hagai -----Original Message----- From: Sam Lang [mailto:[EMAIL PROTECTED] Sent: Thursday, June 28, 2007 6:42 PM To: Hagai Avrahami Subject: Re: Question Hagai, Its not possible to get immediate completion from that particular bmi_recv, because its a post of a receive for write completion, but the post occurs before the write request has been made, so getting a receive immediately isn't going to happen. How are you using IO state machine? -sam On Jun 28, 2007, at 7:36 AM, Hagai Avrahami wrote:Hi Sam I am getting this Error when I am running PVFS2 Client "src/client/sysint/sys-io.sm:1860: io_post_write_ack_recv: Assertion `ret == 0' failed:" When I debugged it I found that ret = job_bmi_recv( cur_ctx->msg.svr_addr, cur_ctx->write_ack.encoded_resp_p, cur_ctx->write_ack.max_resp_sz, cur_ctx->session_tag, BMI_PRE_ALLOC, sm_p, status_user_tag, &cur_ctx->write_ack.recv_status, &cur_ctx-write_ack.recv_id,pint_client_sm_context, JOB_TIMEOUT_INF); Returns 1 for immediate completion, But in io_post_write_ack_recv there is check of assert (ret == 0) Do I understand well the situation? Thanx a lot Hagai__________ NOD32 2366 (20070701) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com__________ NOD32 2368 (20070701) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com__________ NOD32 2151 (20070328) Information __________ This message was checked by NOD32 antivirus system. http://www.eset.com
_______________________________________________ Pvfs2-developers mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers
