Thanks John. I will back the test down to the simple case of 1 client
without the kernel driver and only running NFS Ganesha, and work forward
till I trip the problem and report my findings.

Eric

On Mon, Jul 13, 2015 at 2:18 AM, John Spray <[email protected]> wrote:

>
>
> On 13/07/2015 04:02, Eric Eastman wrote:
>
>> Hi John,
>>
>> I am seeing this problem with Ceph v9.0.1 with the v4.1 kernel on all
>> nodes.  This system is using 4 Ceph FS client systems. They all have
>> the kernel driver version of CephFS loaded, but none are mounting the
>> file system. All 4 clients are using the libcephfs VFS interface to
>> Ganesha NFS (V2.2.0-2) and Samba (Version 4.3.0pre1-GIT-0791bb0) to
>> share out the Ceph file system.
>>
>> # ceph -s
>>      cluster 6d8aae1e-1125-11e5-a708-001b78e265be
>>       health HEALTH_WARN
>>              4 near full osd(s)
>>              mds0: Client ede-c2-gw01 failing to respond to cache pressure
>>              mds0: Client ede-c2-gw02:cephfs failing to respond to cache
>> pressure
>>              mds0: Client ede-c2-gw03:cephfs failing to respond to cache
>> pressure
>>       monmap e1: 3 mons at
>> {ede-c2-mon01=
>> 10.15.2.121:6789/0,ede-c2-mon02=10.15.2.122:6789/0,ede-c2-mon03=10.15.2.123:6789/0
>> }
>>              election epoch 8, quorum 0,1,2
>> ede-c2-mon01,ede-c2-mon02,ede-c2-mon03
>>       mdsmap e912: 1/1/1 up {0=ede-c2-mds03=up:active}, 2 up:standby
>>       osdmap e272: 8 osds: 8 up, 8 in
>>        pgmap v225264: 832 pgs, 4 pools, 188 GB data, 5173 kobjects
>>              212 GB used, 48715 MB / 263 GB avail
>>                   832 active+clean
>>    client io 1379 kB/s rd, 20653 B/s wr, 98 op/s
>>
>
> It would help if we knew whether it's the kernel clients or the userspace
> clients that are generating the warnings here.  You've probably already
> done this, but I'd get rid of any unused kernel client mounts to simplify
> the situation.
>
> We haven't tested the cache limit enforcement with NFS Ganesha, so there
> is a decent chance that it is broken.  The ganehsha FSAL is doing
> ll_get/ll_put reference counting on inodes, so it seems quite possible that
> its cache is pinning things that we would otherwise be evicting in response
> to cache pressure.  You mention samba as well,
>
> You can see if the MDS cache is indeed exceeding its limit by looking at
> the output of:
> ceph daemon mds.<daemon id> perf dump mds
>
> ...where the "inodes" value tells you how many are in the cache, vs.
> inode_max.
>
> If you can, it would be useful to boil this down to a straightforward test
> case: if you start with a healthy cluster, mount a single ganesha client,
> and do your 5 million file procedure, do you get the warning?  Same for
> samba/kernel mounts -- this is likely to be a client side issue, so we need
> to confirm which client is misbehaving.
>
> Cheers,
> John
>
>
>
>> # cat /proc/version
>> Linux version 4.1.0-040100-generic (kernel@gomeisa) (gcc version 4.6.3
>> (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #201506220235 SMP Mon Jun 22 06:36:19
>> UTC 2015
>>
>> # ceph -v
>> ceph version 9.0.1 (997b3f998d565a744bfefaaf34b08b891f8dbf64)
>>
>> The systems are all running Ubuntu Trusty that has been upgraded to
>> the 4.1 kernel. This is all physical machines and no VMs.  The test
>> run that caused the problem was create and verifying 5 million small
>> files.
>>
>> We have some tools that flag when Ceph is in a WARN state so it would
>> be nice to get rid of this warning.
>>
>> Please let me know what additional information you need.
>>
>> Thanks,
>>
>> Eric
>>
>> On Fri, Jul 10, 2015 at 4:19 AM, 谷枫 <[email protected]> wrote:
>>
>>> Thank you John,
>>> All my server is ubuntu14.04 with 3.16 kernel.
>>> Not all of clients appear this problem, the cluster seems functioning
>>> well
>>> now.
>>> As you say,i will change the mds_cache_size to 500000 from 100000 to
>>> take a
>>> test, thanks again!
>>>
>>> 2015-07-10 17:00 GMT+08:00 John Spray <[email protected]>:
>>>
>>>>
>>>> This is usually caused by use of older kernel clients.  I don't remember
>>>> exactly what version it was fixed in, but iirc we've seen the problem
>>>> with
>>>> 3.14 and seen it go away with 3.18.
>>>>
>>>> If your system is otherwise functioning well, this is not a critical
>>>> error
>>>> -- it just means that the MDS might not be able to fully control its
>>>> memory
>>>> usage (i.e. it can exceed mds_cache_size).
>>>>
>>>> John
>>>>
>>>>
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to