I've done some more digging around and I think the issue is storing the 
offset back into the aio kiocb struct when doing synchronous I/O. In 
newer kernels, sychronous I/O goes through aio code path unlike older 
kernels which had separate paths. In do_aio_read_write() synchronous I/O 
is basically a call to do_readv_writev and a return. I assume the aio 
code path handles the kiocb->ki_pos but it's not getting updated when doing 
sync I/O. So the patch just stores the value of the offset into the 
ki_pos. 

Does the attached patch look like the appropriate place to update the 
ki_pos value? It fixes the corrupted output issue and so far seems like 
the best place to update the ki_pos value but I'm open to other 
solutions.

Thanks,
Michael

On Wed, Jan 12, 2011 at 11:54:10AM -0500, Michael Moore wrote:
> Hi,
> 
> I've been looking at this issue more but haven't been able to come up 
> with a fix. The issue appears to me to be centered around the move from 
> a 2.6.18 kernel to a 2.6.19 kernel or newer. Based on archived pvfs2 
> list messages I saw that the pvfs2_file_writev calls went away in 2.6.19 
> in favor of wrapping them in the aio version of the call 
> (pvfs2_file_aio_write_iovec).
> 
> Just a refresher about the test case:
> The difference in the strace between test1.txt (a 1023 byte write plus a 
> 'last line' string ) and test2.txt (a 1024 byte write plus a 'last line' 
> string) is that test1.txt is a single write() of 1056 bytes while 
> text2.txt is a writev() of 1024 bytes plus a write() of 33 bytes.
> The issue that occurs in kernel 2.6.35 (Fedora 14) is that the offset 
> after the writev call is not 1024, it's 0. So, the write dumps 33 bytes 
> at the beginning of the file.
> 
> Attached are the kernel outputs, with augmented debugging statements, 
> from running the same code on a 2.6.18 (CentOS 5.5 kernel) and a 2.6.35 
> (Fedora 14) kernel.
> 
> Any ideas on what may be keeping the file pointer from being correct? 
> Does it appear that the change between 2.6.18 to 2.6.19 and newer 
> kernels is the culprit?
> 
> Thanks for the help,
> Michael
> 
> On Mon, Dec 20, 2010 at 12:18:02PM -0500, Chris Poultney wrote:
> > Hi Sam-
> > 
> > I've tested on Fedora 12 (kernel 2.6.32.26-175) and Ubuntu 10.10 (kernel 
> > 2.6.35-24) with the same result on both.
> > 
> > Michael Moore has also been helping with this - I haven't been copying 
> > the group as it's been a lot of details. But I will when we figure out 
> > the issue!
> > 
> > -crispy
> > 
> > 
> > On 12/20/2010 11:31 AM, Sam Lang wrote:
> > >
> > > Hi Chris,
> > >
> > > This looks like it might be an issue with writev in the kernel.  Are you 
> > > running RHEL5 by chance?
> > > -sam
> > >
> > > On Dec 15, 2010, at 8:39 AM, Chris Poultney wrote:
> > >
> > >> Sure - it's attached. I ran exactly the code I posted earlier.
> > >>
> > >> -crispy
> > >>
> > >>
> > >> On 12/14/2010 07:25 PM, Sam Lang wrote:
> > >>>
> > >>> Hi Chris,
> > >>>
> > >>> Would you be willing to run your test with strace and post the output?  
> > >>> That will help us figure out what read/write system calls PVFS is 
> > >>> getting from your app.
> > >>>
> > >>> Thanks,
> > >>> -sam
> > >>>
> > >>> On Dec 14, 2010, at 4:29 PM, Chris Poultney wrote:
> > >>>
> > >>>> All-
> > >>>>
> > >>>> I'm having a problem where files written to a PVFS volume overwrite 
> > >>>> the beginning of the file after some output has already been 
> > >>>> generated, as if lseek() had been called mid-write. The overwriting 
> > >>>> behavior does not show up on a regular ext3 volume. Sample code is 
> > >>>> attached. I'd like to know if anyone can duplicate this behavior, 
> > >>>> because it looks like a bug.
> > >>>>
> > >>>> Essentially what happens is this: given a large (1024+ character) 
> > >>>> string, I write some portion of the string using a C++ ostream, then 
> > >>>> write another line:
> > >>>>
> > >>>> out<<   st.substr(0, N);
> > >>>> out<<   "done";
> > >>>>
> > >>>> If N<   1024, everything is fine. If N>= 1024, "done" is written at 
> > >>>> the beginning of the file, overwriting what was there before.
> > >>>>
> > >>>> I'd love to figure this out!
> > >>>>
> > >>>> Cheers,
> > >>>> -crispy
> > >>>>
> > >>>> <pvfstest.cpp>_______________________________________________
> > >>>> Pvfs2-users mailing list
> > >>>> [email protected]
> > >>>> http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
> > >>>
> > >> <strace.out>
> > >
> > _______________________________________________
> > Pvfs2-users mailing list
> > [email protected]
> > http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

> pvfs2_file_open: called on test2.txt (inode is 768614336404564646)
> pvfs2_file_open returning normally: 0
> do_readv_writev: passed file->f_pos: 0(ffff810124fed8f8), offset 
> 0(ffff810049985f50)
> pvfs2_file_writev: proceeding with offset : 0, size 1024
> pvfs2_file_writev 1...@0
> pvfs2_file_writev: new_nr_segs: 2, seg_count: 1
> Alloced OP (ffff810049a30238: 204 OP_FILE_IO)
> GET op ffff810049a30238 -> buffer_index 0
> pvfs2_file_writev: copy_to_user 1 nr_segs 2, offset: 0 total_size: 1024
> pvfs_bufmap_copy_iovec_from_user: index 0, size 1024
> First character (integer value) in pvfs_bufmap_copy_from_user: 48
> pvfs2: service_operation: pvfs2_file_writev ffff810049a30238
> pvfs2: operation posted by process: a.out, pid: 31936
> client-core: reading op tag 204 OP_FILE_IO
> (get) Alloced OP (ffff810049a30238:204)
> pvfs2: service_operation pvfs2_file_writev returning: 0 for ffff810049a30238.
> wait_for_io returning 1024
> PUT buffer_index 0
> Releasing OP (ffff810049a30238: 204)
> do_readv_writev: offset updated to 1024(ffff810049985f50)
> pvfs2_dirty_inode: 768614336404564646
> do_readv_writev: before return file->f_pos: 1024(ffff810124fed8f8), offset 
> 1024(ffff810049985f50)
> do_readv_writev: passed file->f_pos: 1024(ffff810124fed8f8), offset 
> 1024(ffff810049985f50)
> pvfs2_file_write: proceeding with offset : 1024, size 33
> pvfs2_file_write 3...@1024
> pvfs2_file_write: new_nr_segs: 1, seg_count: 1
> Alloced OP (ffff8100499881f8: 205 OP_FILE_IO)
> GET op ffff8100499881f8 -> buffer_index 0
> pvfs2_file_write: copy_to_user 1 nr_segs 1, offset: 1024 total_size: 33
> pvfs_bufmap_copy_iovec_from_user: index 0, size 33
> First character (integer value) in pvfs_bufmap_copy_from_user: 10
> pvfs2: service_operation: pvfs2_file_write ffff8100499881f8
> pvfs2: operation posted by process: a.out, pid: 31936
> client-core: reading op tag 205 OP_FILE_IO
> (get) Alloced OP (ffff8100499881f8:205)
> pvfs2: service_operation pvfs2_file_write returning: 0 for ffff8100499881f8.
> wait_for_io returning 33
> PUT buffer_index 0
> Releasing OP (ffff8100499881f8: 205)
> do_readv_writev: offset updated to 1057(ffff810049985f50)
> pvfs2_dirty_inode: 768614336404564646
> do_readv_writev: before return file->f_pos: 1057(ffff810124fed8f8), offset 
> 1057(ffff810049985f50)
> pvfs2_file_release: called on test2.txt
> 

> [74675.476920] pvfs2_file_open: called on test2.txt (inode is 1048573)
> [74675.476922] pvfs2_file_open returning normally: 0
> [74675.476964] pvfs2_file_aio_write_iovec: passed iocb->ki_pos: 
> 0(ffff88003debbda0), storing offset: 0(ffff88003debbc70)
> [74675.476968] do_aio_read_write: passed iocb->ki_pos: 0(ffff88003debbda0), 
> filp->f_pos: 0(ffff88003b69cdc0), *offset: 0(ffff88003debbc70)
> [74675.476972] do_readv_writev: passed file->f_pos: 0(ffff88003b69cdc0), 
> offset 0(ffff88003debbc70)
> [74675.476975] pvfs2_file_aio_write_iovec: proceeding with offset : 0, size 
> 1024
> [74675.476978] pvfs2_file_aio_write_iovec 1...@0
> [74675.476980] pvfs2_file_aio_write_iovec: new_nr_segs: 2, seg_count: 1
> [74675.476984] Alloced OP (ffff880035118000: 154 OP_FILE_IO)
> [74675.476988] GET op ffff880035118000 -> buffer_index 0
> [74675.476990] pvfs2_file_aio_write_iovec: copy_to_user 1 nr_segs 2, offset: 
> 0 total_size: 1024
> [74675.476993] pvfs_bufmap_copy_iovec_from_user: index 0, size 1024
> [74675.476997] First character (integer value) in pvfs_bufmap_copy_from_user: 
> 48
> [74675.477000] pvfs2: service_operation: pvfs2_file_aio_write_iovec 
> ffff880035118000
> [74675.477002] pvfs2: operation posted by process: a.out, pid: 5556
> [74675.477772] client-core: reading op tag 154 OP_FILE_IO
> [74675.482042] (get) Alloced OP (ffff880035118000:154)
> [74675.482603] pvfs2: service_operation pvfs2_file_aio_write_iovec returning: 
> 0 for ffff880035118000.
> [74675.482631] wait_for_io returning 1024
> [74675.482681] PUT buffer_index 0
> [74675.482704] NULL pointer in op_release
> [74675.482727] do_readv_writev: offset updated to 1024(ffff88003debbc70)
> [74675.482749] pvfs2_dirty_inode: 1048573
> [74675.482771] do_readv_writev: before return file->f_pos: 
> 1024(ffff88003b69cdc0), offset 1024(ffff88003debbc70)
> [74675.482793] do_aio_read_write: returning iocb->ki_pos: 
> 0(ffff88003debbda0), filp->f_pos: 1024(ffff88003b69cdc0), *offset: 
> 1024(ffff88003debbc70)
> [74675.482840] do_readv_writev: passed file->f_pos: 0(ffff88003b69cdc0), 
> offset 0(ffff88003debbf58)
> [74675.482861] pvfs2_file_write: proceeding with offset : 0, size 33
> [74675.482881] pvfs2_file_write 3...@0
> [74675.482899] pvfs2_file_write: new_nr_segs: 1, seg_count: 1
> [74675.482938] Alloced OP (ffff88003b800000: 155 OP_FILE_IO)
> [74675.482955] GET op ffff88003b800000 -> buffer_index 0
> [74675.482983] pvfs2_file_write: copy_to_user 1 nr_segs 1, offset: 0 
> total_size: 33
> [74675.482999] pvfs_bufmap_copy_iovec_from_user: index 0, size 33
> [74675.483015] First character (integer value) in pvfs_bufmap_copy_from_user: 
> 10
> [74675.483028] pvfs2: service_operation: pvfs2_file_write ffff88003b800000
> [74675.483074] pvfs2: operation posted by process: a.out, pid: 5556
> [74675.483439] Releasing OP (ffff880035118000: 154)
> [74675.483623] client-core: reading op tag 155 OP_FILE_IO
> [74675.488036] (get) Alloced OP (ffff88003b800000:155)
> [74675.488694] pvfs2: service_operation pvfs2_file_write returning: 0 for 
> ffff88003b800000.
> [74675.488727] wait_for_io returning 33
> [74675.488788] PUT buffer_index 0
> [74675.488816] NULL pointer in op_release
> [74675.488844] do_readv_writev: offset updated to 33(ffff88003debbf58)
> [74675.488872] pvfs2_dirty_inode: 1048573
> [74675.488898] do_readv_writev: before return file->f_pos: 
> 33(ffff88003b69cdc0), offset 33(ffff88003debbf58)
> [74675.488930] pvfs2_file_release: called on test2.txt

Index: src/kernel/linux-2.6/file.c
===================================================================
RCS file: /projects/cvsroot/pvfs2/src/kernel/linux-2.6/file.c,v
retrieving revision 1.151.10.4
diff -a -u -p -r1.151.10.4 file.c
--- src/kernel/linux-2.6/file.c 12 Nov 2010 16:07:57 -0000      1.151.10.4
+++ src/kernel/linux-2.6/file.c 13 Jan 2011 02:02:21 -0000
@@ -2684,6 +2684,8 @@ static ssize_t do_aio_read_write(struct 
     if (!rw->async)
     {
         error = do_readv_writev(rw);
+        /* store the offset from the read/write into the kiocb struct */
+        iocb->ki_pos = *offset;
         goto out_error;
     }
     /* Asynchronous I/O */
_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to