On Fri, Apr 27, 2018 at 11:49 PM, Oliver Freyermuth
<freyerm...@physik.uni-bonn.de> wrote:
> Dear Yan Zheng,
>
> Am 27.04.2018 um 15:32 schrieb Yan, Zheng:
>> On Fri, Apr 27, 2018 at 7:10 PM, Oliver Freyermuth
>> <freyerm...@physik.uni-bonn.de> wrote:
>>> Dear Yan Zheng,
>>>
>>> Am 27.04.2018 um 02:58 schrieb Yan, Zheng:
>>>> On Thu, Apr 26, 2018 at 10:00 PM, Oliver Freyermuth
>>>> <freyerm...@physik.uni-bonn.de> wrote:
>>>>> Dear Cephalopodians,
>>>>>
>>>>> just now that our Ceph cluster is under high I/O load, we get user 
>>>>> reports of files not being seen on some clients,
>>>>> but somehow showing up after forcing a stat() syscall.
>>>>>
>>>>> For example, one user had added several files to a directory via an NFS 
>>>>> client attached to nfs-ganesha (which uses libcephfs),
>>>>> and afterwards, all other nfs-ganesha servers saw it, and 44 of our 
>>>>> Fuse-clients -
>>>>> but one single client still saw the old contents of the directory, i.e. 
>>>>> the files seemed missing(!).
>>>>> This happened both when using "ls" on the directory or when trying to 
>>>>> access the non-existent files directly.
>>>>>
>>>>> I could confirm this observation also in a fresh login shell on the 
>>>>> machine.
>>>>>
>>>>> Then, on the "broken" client, I entered in the directory which seemed to 
>>>>> contain only the "old" content, and I created a new file in there.
>>>>> This worked fine, and all other clients saw the file immediately.
>>>>> Also on the broken client, metadata was now updated and all other files 
>>>>> appeared - i.e. everything was "in sync" again.
>>>>>
>>>>> There's nothing in the ceph-logs of our MDS, or in the syslogs of the 
>>>>> client machine / MDS.
>>>>>
>>>>>
>>>>> Another user observed the same, but not explicitly limited to one machine 
>>>>> (it seems random).
>>>>> He now uses a "stat" on the file he expects to exist (but which is not 
>>>>> seen with "ls").
>>>>> The stat returns "No such file", a subsequent "ls" then however lists the 
>>>>> file, and it can be accessed normally.
>>>>>
>>>>> This feels like something is messed up concerning the client caps - these 
>>>>> are all 12.2.4 Fuse clients.
>>>>>
>>>>> Any ideas how to find the cause?
>>>>> It only happens since recently, and under high I/O load with many 
>>>>> metadata operations.
>>>>>
>>>>
>>>> Sounds like bug in readdir cache. Could you try the attached patch.
>>>
>>> Many thanks for the quick response and patch!
>>> The problem is to try it out. We only observe this issue on our production 
>>> cluster, randomly, especially during high load, and only after is has been 
>>> running for a few days.
>>> We don't have a test Ceph cluster available of similar size and with 
>>> similar load. I would not like to try out the patch on our production 
>>> system.
>>>
>>> Can you extrapolate from the bugfix / patch what's the minimal setup needed 
>>> to reproduce / trigger the issue?
>>> Then we may look into setting up a minimal test setup to check whether the 
>>> issue is resolved.
>>>
>>> All the best and many thanks,
>>>         Oliver
>>>
>>
>> I think this is libcephfs version of
>> http://tracker.ceph.com/issues/20467. I forgot to write patch for
>> libcephfs, Sorry. To reproduce this,  write a program that call
>> getdents(2) in a loop. Add artificially delay to the loop, make the
>> program iterates whole directory in about ten seconds. Run several
>> instance of the program simultaneously on a large directory. Also make
>> client_cache_size a little smaller than the size of directory.
>
> This is strange - in case 1 where our users observed the issue,
> the affected directory contained exactly 1 file, which some clients saw and 
> others did not.
> In case 2, the affected directory contained about 5 files only.
>
> Of course, we also have directories with many (thousands) of files in our 
> CephFS, and they may be accessed in parallel.
> Also, we run a massive number of parallel programs (about 2000) accessing the 
> FS via about 40 clients.
>
> 1. Could this still be the same issue?
> 2. Many thanks for the repro-instructions. It seems, however, this would 
> require quite an amount of time,
>    since we don't have a separate "test" instance at hand (yet) and are not 
> experts on the field.
>    We could try, but it won't be fast... And meybe it's nicer to have 
> something like this in the test suite, if possible.
>
> Potentially, it's even faster to get the fix in the next patch release, if 
> it's clear this can not have bad side effects.
>
> Also, should we transfer this information to a ticket?
>
> Cheers and many thanks,
>         Oliver
>

I found an issue in the code that handle session stale message. Steps
to reproduce are at http://tracker.ceph.com/issues/23894.

Regards
Yan, Zheng

>>
>> Regards
>> Yan, Zheng
>>
>>>
>>>>
>>>> Regards
>>>> Yan, Zheng
>>>>
>>>>
>>>>> Cheers,
>>>>>         Oliver
>>>>>
>>>>>
>>>>
>>>>> _______________________________________________
>>>>> ceph-users mailing list
>>>>> ceph-users@lists.ceph.com
>>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>>
>>>
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to