On Tue, Jan 16, 2018 at 8:50 PM, Andras Pataki
<[email protected]> wrote:
> Dear Cephers,
>
> We've upgraded the back end of our cluster from Jewel (10.2.10) to Luminous
> (12.2.2). The upgrade went smoothly for the most part, except we seem to be
> hitting an issue with cephfs. After about a day or two of use, the MDS
> start complaining about clients failing to respond to cache pressure:
What's the OS, kernel version and fuse version on the hosts where the
clients are running?
There have been some issues with ceph-fuse losing the ability to
properly invalidate cached items when certain updated OS packages were
installed.
Specifically, ceph-fuse checks the kernel version against 3.18.0 to
decide which invalidation method to use, and if your OS has backported
new behaviour to a low-version-numbered kernel, that can confuse it.
John
>
> [root@cephmon00 ~]# ceph -s
> cluster:
> id: d7b33135-0940-4e48-8aa6-1d2026597c2f
> health: HEALTH_WARN
> 1 MDSs have many clients failing to respond to cache pressure
> noout flag(s) set
> 1 osds down
>
> services:
> mon: 3 daemons, quorum cephmon00,cephmon01,cephmon02
> mgr: cephmon00(active), standbys: cephmon01, cephmon02
> mds: cephfs-1/1/1 up {0=cephmon00=up:active}, 2 up:standby
> osd: 2208 osds: 2207 up, 2208 in
> flags noout
>
> data:
> pools: 6 pools, 42496 pgs
> objects: 919M objects, 3062 TB
> usage: 9203 TB used, 4618 TB / 13822 TB avail
> pgs: 42470 active+clean
> 22 active+clean+scrubbing+deep
> 4 active+clean+scrubbing
>
> io:
> client: 56122 kB/s rd, 18397 kB/s wr, 84 op/s rd, 101 op/s wr
>
> [root@cephmon00 ~]# ceph health detail
> HEALTH_WARN 1 MDSs have many clients failing to respond to cache pressure;
> noout flag(s) set; 1 osds down
> MDS_CLIENT_RECALL_MANY 1 MDSs have many clients failing to respond to cache
> pressure
> mdscephmon00(mds.0): Many clients (103) failing to respond to cache
> pressureclient_count: 103
> OSDMAP_FLAGS noout flag(s) set
> OSD_DOWN 1 osds down
> osd.1296 (root=root-disk,pod=pod0-disk,host=cephosd008-disk) is down
>
>
> We are using exclusively the 12.2.2 fuse client on about 350 nodes or so
> (out of which it seems 100 are not responding to cache pressure in this
> log). When this happens, clients appear pretty sluggish also (listing
> directories, etc.). After bouncing the MDS, everything returns on normal
> after the failover for a while. Ignore the message about 1 OSD down, that
> corresponds to a failed drive and all data has been re-replicated since.
>
> We were also using the 12.2.2 fuse client with the Jewel back end before the
> upgrade, and have not seen this issue.
>
> We are running with a larger MDS cache than usual, we have mds_cache_size
> set to 4 million. All other MDS configs are the defaults.
>
> Is this a known issue? If not, any hints on how to further diagnose the
> problem?
>
> Andras
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com