Hi Hagai,

Sorry I didn't pick up on this before, It looks like you're right. I've attached a patch that I think should fix this race problem. Can you try it out and let me know if it works for you?

Thanks,

-sam

Attachment: sys-io-race-fix.patch
Description: Binary data



On Jul 18, 2007, at 7:50 AM, Hagai Avrahami wrote:

Hi Sam

When running my application I got this assert sometimes

"src/client/sysint/sys-io.sm:1860: io_post_write_ack_recv:
                                                Assertion `ret == 0'
failed:"

I tried to get this assert by running one of the tests included in pvfs2
package

I build this setup:

4 pvfs server with storage space on the same machine and client runs as well
on this machine.

I made small modification in io-stress.c:

1. I linked Test module with pvfs-threaded instead of pvfs only
2. I changed write block size to 1MB
2. I am running the test in a loop and writing in each iteration 50MB.

Running this test reproduce the problem.

I realized that:

1. This problem seems like a race between the thread responsible of sending
   Write Buffer and BMI thread.

2. Because everything runs on the same machine and latency of message is

   Very low the reply returns very fast.

3. To make this assumption more clear to me, I run all 4 pvfs servers on a

   RAM Disk and the Assert reproduced in every run.

Appreciate any help
Thanx
Hagai


-----Original Message-----
From: Sam Lang [mailto:[EMAIL PROTECTED]
Sent: Tuesday, July 03, 2007 3:37 PM
To: Hagai Avrahami
Subject: Re: Question


On Jul 2, 2007, at 12:38 PM, Hagai Avrahami wrote:

Hi Sam

I took io-stress changed it a little bit and then tried to write
blocks
after EOF and it all went good.
So I guess it something I did wrong in my code....

Maybe you have any suggestions what can lead to this assert?

Not without knowing more about the code you've written.  My best
guess would be that you have memory corruption somewhere else and its
causing an erroneous error in your code.  That's unlikely though.
You could try running it in valgrind and see if you get any errors.


Do PVFS2 fill the gaps with 0(zero)?

Yes.


When I am trying to write to offset after EOF how does PVFS2 knows
it has
passed EOF (how does it know the size of file?), as I understood
Get Size is
operations involving query of all IO severs in the collection.

Yes each IO server's stripe size is maintained on the IO server.  The
client doesn't need to *know* that you're writing to an offset past
EOF, it just determines the writes that need to be made to each
server.  Determining the size of the file and the end of file are
only important for a read operation.

-sam


Thanx
Hagai


-----Original Message-----
From: Sam Lang [mailto:[EMAIL PROTECTED]
Sent: Sunday, July 01, 2007 8:24 PM
To: Hagai Avrahami
Subject: Re: Question


Hi Hagai,

What you're trying to do should work (writing past EOF).  Do you have
a test program that I could run to reproduce the problem?

-sam

On Jul 1, 2007, at 1:56 PM, Hagai Avrahami wrote:

Hi Sam

I Use PVFS_isys_io(....) to write data
And I use PVFS_sys_testsome(...) to fetch all finished operations.

I am trying to write file with size of 512MB, Stripe size of 1MB.

Every 3 continuous block I write next one with gap of 1 MB....
Write 1(Offset - 0), Write 2(Offset 1MB), Write 2(Offset 2MB),
Write 2(Offset 4MB), Write 2(Offset 5MB), Write 2(Offset 6MB),
Write 2(Offset 8MB),

If am writing with no gaps I don't get this Assert and with the
gaps It
happens every time during the write.

Can't I write to offset bigger than the size of the file?
I Assumed that if I do so, PVFS2 will complete the gap with 0(zero)?
Is this true?

Appreciate your help
Thanx Hagai

-----Original Message-----
From: Sam Lang [mailto:[EMAIL PROTECTED]
Sent: Thursday, June 28, 2007 6:42 PM
To: Hagai Avrahami
Subject: Re: Question


Hagai,

Its not possible to get immediate completion from that particular
bmi_recv, because its a post of a receive for write completion, but
the post occurs before the write request has been made, so getting a
receive immediately isn't going to happen.

How are you using IO state machine?

-sam

On Jun 28, 2007, at 7:36 AM, Hagai Avrahami wrote:

Hi Sam



I am getting this Error when I am running PVFS2 Client



"src/client/sysint/sys-io.sm:1860: io_post_write_ack_recv:
Assertion `ret == 0' failed:"



When I debugged it I found that



ret = job_bmi_recv(

        cur_ctx->msg.svr_addr, cur_ctx->write_ack.encoded_resp_p,

        cur_ctx->write_ack.max_resp_sz, cur_ctx->session_tag,

        BMI_PRE_ALLOC, sm_p, status_user_tag,

        &cur_ctx->write_ack.recv_status, &cur_ctx-
write_ack.recv_id,

        pint_client_sm_context, JOB_TIMEOUT_INF);





Returns 1 for immediate completion,

 But in io_post_write_ack_recv there is check of assert (ret == 0)



Do I understand well the situation?



Thanx a lot

Hagai








__________ NOD32 2366 (20070701) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com




__________ NOD32 2368 (20070701) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com




__________ NOD32 2151 (20070328) Information __________

This message was checked by NOD32 antivirus system.
http://www.eset.com



_______________________________________________
Pvfs2-developers mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-developers

Reply via email to