After some further digging and off-list help from Phil, we determined that
the pvfs2_inode structures were not being properly initialized.



PVFS2 is using slab cache for the pvfs2_inode structures which allcoates and
initializes a big chunk of them. There is a constructor function
(pvfs2_inode_cache_ctor) passed to the kmem_cache_create call that does the
initialization up front to cut down on expensive setup time for semaphores
and stuff later. The constructor only gets called one time when the cache is
created or again later only if the cache needs to be grown. When the memory
is released back into the cache, none of the contents are cleared before
being handed out again.



In the pvfs2_inode_alloc function there is a call to kmem_cache_alloc to get
a pvfs2_inode structure from the slab cache, but it is never initialized. It
looks like PVFS2 expects the constructor to be called every time a
kmem_cache_alloc call is made, because that is the only place the
pvfs2_inode structures are cleared. Since they are not initialized, some of
the fields - including the pinode_flags - are never reset from their
previous use. If a pvfs2_inode structure has leftover pinode_flags that
indicate an mtime update is required and that structure is handed out by the
cache again, pvfs2_flush_inode does a setattr when the file is released
updating the mtime on the file which may or may not actually need it.



The same constructor/initialization situation exists for dev_req_alloc and
kiocb_alloc; the initializations are made once at cache creation time
instead of each time a structure is allocated.



The attached patches clear more fields in the pvfs2_inode strcuture and
directly call the pvfs2_inode_initialize function for each alloc. They also
remove the constructor functions for kiocb and dev_req and initialize them
in their respective alloc functions. There is a patch for 2.6 and another
for 2.8.2. The patch for 2.6 makes a few small modifications to be more like
2.8.

Bart.




On Fri, Mar 5, 2010 at 9:20 AM, Bart Taylor <[email protected]> wrote:

>
> After some more digging, I found that pvfs2_clear_inode is being called on
> the inode before the timestamp changes. That call destroys the pvfs2_inode,
> so the next time getattr is called on it, the inode has to be reallocated.
> When the inode gets reallocated, it is initialized with pinode_flags that
> indicate the mtime needs to be set, so when that file is finally released,
> pvfs2_flush_inode calls setattr and updates the mtime.
>
> I modified the pinode_flags to unset the mtime flag in pvfs2_inode_alloc.
> That took care of the problem, but I am not sure what else that will affect.
> I do not see any code in pvfs that is assigning the flags, so I assume it is
> coming from the kernel during the kmem_cache_alloc.
>
> That alloc function returns with just the P_INIT_FLAG every time except for
> the instance where the mtime is getting updated. In that case it also has
> the P_ATIME_FLAG and P_MTIME_FLAG set. Does anyone know why this function
> would sometimes return with more flags set? Could it have something to do
> with a make_bad_inode call?
>
> I should also mention that we have only seen this on 2.4 kernels.
>
> Bart.
>
>
>
>
>
>
>
>
>
>
> On Wed, Feb 24, 2010 at 9:15 AM, Bart Taylor <[email protected]> wrote:
>
>> Actually I managed to trigger the same timestamp change on 2.8.2 this
>> morning. I attached a copy to the job running against that file system and
>> triggering the timestamp change; acache and ncache logging are disabled, but
>> all other logging is enabled.
>>
>> Bart.
>>
>>
>>
>>
>> A reference for looking through the log file:
>>
>> File/Directory                                    Handle
>> =================================
>> /                                               1048576
>>  /small-job/                                 715624920
>> /small-job/data_file                     2147280687
>> /small-job/temp/                        1431452804
>> /small-job/temp/output_file         1431452797
>> /small-job/temp/output_file.ctl     2147280686
>>
>>
>>
>>
>>
>> On Tue, Feb 23, 2010 at 11:11 PM, Bart Taylor <[email protected]> wrote:
>>
>>> Hey guys,
>>>
>>>
>>>
>>> We are running into a scenario where modify timestamps are getting
>>> updated when we do not think they should be. We have a single client
>>> accessing a single node file system that is reading an input file (357
>>> bytes) and writing to two output files (~500 bytes) in a subdirectory.
>>> The timestamp is sporadically (one time in 10 or 20 runs) updated on the
>>> input file, but only if the write occurs (on the output file). I tried
>>> removing the portion of the job that writes to the output file and the
>>> timestamps never changes. I also moved the job off of PVFS2 and the
>>> timestamps never changes.
>>>
>>>
>>>
>>> The file system is a heavily patched version of the 2.6 tree. I ran the
>>> same test on the latest 2.8.2 code and could not replicate the timestamp
>>> change. Unfortunately we cannot upgrade everything to 2.8.2 yet. Does anyone
>>> recall running into this particular problem, or have an idea of what might
>>> be causing it? I have attached a log file from the job with some
>>> explanations below.
>>>
>>>
>>>
>>> Thanks,
>>>
>>> Bart.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> I turned on verbose client logging and "32767" kernel logging and
>>> captured a run of the job failing. Acache and ncache are disabled. There are
>>> a few extra log messages that log when the SetMtimeFlag() call is made, but
>>> they do not show the flag being set for data_file.
>>>
>>>
>>>
>>> A reference for looking through the log file:
>>>
>>>
>>>
>>> File/Directory                            Handle
>>>
>>> =============================
>>>
>>> /                                                 1048576
>>>
>>> /small-job/                                  1047532
>>>
>>> /small-job/data_file                    1047520
>>>
>>> /small-job/temp/                         1047531
>>>
>>> /small-job/temp/output_file        1047501
>>>
>>> /small-job/temp/output_file.ctl   1047518
>>>
>>
>>
>

Attachment: kmem-cache-28.patch
Description: Binary data

Attachment: kmem-cache-26.patch
Description: Binary data

_______________________________________________
Pvfs2-users mailing list
[email protected]
http://www.beowulf-underground.org/mailman/listinfo/pvfs2-users

Reply via email to