After some further digging and off-list help from Phil, we determined that the pvfs2_inode structures were not being properly initialized.
PVFS2 is using slab cache for the pvfs2_inode structures which allcoates and initializes a big chunk of them. There is a constructor function (pvfs2_inode_cache_ctor) passed to the kmem_cache_create call that does the initialization up front to cut down on expensive setup time for semaphores and stuff later. The constructor only gets called one time when the cache is created or again later only if the cache needs to be grown. When the memory is released back into the cache, none of the contents are cleared before being handed out again. In the pvfs2_inode_alloc function there is a call to kmem_cache_alloc to get a pvfs2_inode structure from the slab cache, but it is never initialized. It looks like PVFS2 expects the constructor to be called every time a kmem_cache_alloc call is made, because that is the only place the pvfs2_inode structures are cleared. Since they are not initialized, some of the fields - including the pinode_flags - are never reset from their previous use. If a pvfs2_inode structure has leftover pinode_flags that indicate an mtime update is required and that structure is handed out by the cache again, pvfs2_flush_inode does a setattr when the file is released updating the mtime on the file which may or may not actually need it. The same constructor/initialization situation exists for dev_req_alloc and kiocb_alloc; the initializations are made once at cache creation time instead of each time a structure is allocated. The attached patches clear more fields in the pvfs2_inode strcuture and directly call the pvfs2_inode_initialize function for each alloc. They also remove the constructor functions for kiocb and dev_req and initialize them in their respective alloc functions. There is a patch for 2.6 and another for 2.8.2. The patch for 2.6 makes a few small modifications to be more like 2.8. Bart. On Fri, Mar 5, 2010 at 9:20 AM, Bart Taylor <[email protected]> wrote: > > After some more digging, I found that pvfs2_clear_inode is being called on > the inode before the timestamp changes. That call destroys the pvfs2_inode, > so the next time getattr is called on it, the inode has to be reallocated. > When the inode gets reallocated, it is initialized with pinode_flags that > indicate the mtime needs to be set, so when that file is finally released, > pvfs2_flush_inode calls setattr and updates the mtime. > > I modified the pinode_flags to unset the mtime flag in pvfs2_inode_alloc. > That took care of the problem, but I am not sure what else that will affect. > I do not see any code in pvfs that is assigning the flags, so I assume it is > coming from the kernel during the kmem_cache_alloc. > > That alloc function returns with just the P_INIT_FLAG every time except for > the instance where the mtime is getting updated. In that case it also has > the P_ATIME_FLAG and P_MTIME_FLAG set. Does anyone know why this function > would sometimes return with more flags set? Could it have something to do > with a make_bad_inode call? > > I should also mention that we have only seen this on 2.4 kernels. > > Bart. > > > > > > > > > > > On Wed, Feb 24, 2010 at 9:15 AM, Bart Taylor <[email protected]> wrote: > >> Actually I managed to trigger the same timestamp change on 2.8.2 this >> morning. I attached a copy to the job running against that file system and >> triggering the timestamp change; acache and ncache logging are disabled, but >> all other logging is enabled. >> >> Bart. >> >> >> >> >> A reference for looking through the log file: >> >> File/Directory Handle >> ================================= >> / 1048576 >> /small-job/ 715624920 >> /small-job/data_file 2147280687 >> /small-job/temp/ 1431452804 >> /small-job/temp/output_file 1431452797 >> /small-job/temp/output_file.ctl 2147280686 >> >> >> >> >> >> On Tue, Feb 23, 2010 at 11:11 PM, Bart Taylor <[email protected]> wrote: >> >>> Hey guys, >>> >>> >>> >>> We are running into a scenario where modify timestamps are getting >>> updated when we do not think they should be. We have a single client >>> accessing a single node file system that is reading an input file (357 >>> bytes) and writing to two output files (~500 bytes) in a subdirectory. >>> The timestamp is sporadically (one time in 10 or 20 runs) updated on the >>> input file, but only if the write occurs (on the output file). I tried >>> removing the portion of the job that writes to the output file and the >>> timestamps never changes. I also moved the job off of PVFS2 and the >>> timestamps never changes. >>> >>> >>> >>> The file system is a heavily patched version of the 2.6 tree. I ran the >>> same test on the latest 2.8.2 code and could not replicate the timestamp >>> change. Unfortunately we cannot upgrade everything to 2.8.2 yet. Does anyone >>> recall running into this particular problem, or have an idea of what might >>> be causing it? I have attached a log file from the job with some >>> explanations below. >>> >>> >>> >>> Thanks, >>> >>> Bart. >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> I turned on verbose client logging and "32767" kernel logging and >>> captured a run of the job failing. Acache and ncache are disabled. There are >>> a few extra log messages that log when the SetMtimeFlag() call is made, but >>> they do not show the flag being set for data_file. >>> >>> >>> >>> A reference for looking through the log file: >>> >>> >>> >>> File/Directory Handle >>> >>> ============================= >>> >>> / 1048576 >>> >>> /small-job/ 1047532 >>> >>> /small-job/data_file 1047520 >>> >>> /small-job/temp/ 1047531 >>> >>> /small-job/temp/output_file 1047501 >>> >>> /small-job/temp/output_file.ctl 1047518 >>> >> >> >
kmem-cache-28.patch
Description: Binary data
kmem-cache-26.patch
Description: Binary data
_______________________________________________ Pvfs2-users mailing list [email protected] http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users
