Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
Original Message: > > > On 7/25/19 7:49 AM, Sangwhan Moon wrote: > > Hello, > > > > Original Message: > >> > >> > >> On 7/25/19 6:49 AM, Sangwhan Moon wrote: > >>> Hello, > >>> > >>> I've inherited a Ceph cluster from someone who has left zero > >>> documentation or any handover. A couple days ago it decided to show the > >>> entire company what it is capable of.. > >>> > >>> The health report looks like this: > >>> > >>> [root@host mnt]# ceph -s > >>> cluster: > >>> id: 809718aa-3eac-4664-b8fa-38c46cdbfdab > >>> health: HEALTH_ERR > >>> 1 MDSs report damaged metadata > >>> 1 MDSs are read only > >>> 2 MDSs report slow requests > >>> 6 MDSs behind on trimming > >>> Reduced data availability: 2 pgs stale > >>> Degraded data redundancy: 2593/186803520 objects degraded > >>> (0.001%), 2 pgs degraded, 2 pgs undersized > >>> 1 slow requests are blocked > 32 sec. Implicated osds > >>> 716 stuck requests are blocked > 4096 sec. Implicated osds > >>> 25,31,38\ > >> > >> I would start here: > >> > >>> > >>> services: > >>> mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0 > >>> mgr: a(active) > >>> mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up > >>> {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf > >>> 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active}, > >>> 4 up:sta > >>> ndby-replay > >>> osd: 39 osds: 39 up, 38 in > >>> > >>> data: > >>> pools: 5 pools, 706 pgs > >>> objects: 91212k objects, 4415 GB > >>> usage: 10415 GB used, 13024 GB / 23439 GB avail > >>> pgs: 2593/186803520 objects degraded (0.001%) > >>> 703 active+clean > >>> 2 stale+active+undersized+degraded > >> > >> This is a problem! Can you check: > >> > >> $ ceph pg dump_stuck > >> > >> The PGs will start with a number like 8.1a where '8' it the pool ID. > >> > >> Then check: > >> > >> $ ceph df > >> > >> To which pools to those PGs belong? > >> > >> Then check: > >> > >> $ ceph pg query > >> > >> And the bottom somewhere should show why these PGs are not active. You > >> might even want to try a restart of these OSDs involved with those two PGs. > > > > Thanks a lot for the suggestions - I just checked and it says that the > > problematic PGs are 4.4f and 4.59 - but querying those seem result in the > > following error: > > > > Error ENOENT: i don't have pgid 4.4f > > > > (same applies for 4.59 - they do seem to show up in "ceph pg ls" though.) > > > > In ceph pg ls, it shows that for these PGs UP, UP_PRIMARY ACTING, > > ACTING_PRIMARY all only have one OSD associated with it. (24, 13 - although > > both the PG ID mentioned above and these numbers probably don't help much > > with the diagnosis) Should restarting be a safe thing to try first? > > > > ceph health detail says the following: > > > > MDS_DAMAGE 1 MDSs report damaged metadata > > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Metadata damage detected > > MDS_READ_ONLY 1 MDSs are read only > > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): MDS in read-only mode > > MDS_SLOW_REQUEST 2 MDSs report slow requests > > mdsuser-fs-5668c75f9f-hflps(mds.0): 3 slow requests are blocked > 30 sec > > mdsuser-fs-5668c75f9f-jf59x(mds.1): 980 slow requests are blocked > 30 > > sec > > MDS_TRIM 6 MDSs behind on trimming > > mdsuser-fs-5668c75f9f-hflps(mds.0): Behind on trimming (342/128) > > max_segments: 128, num_segments: 342 > > mdsuser-fs-5668c75f9f-jf59x(mds.1): Behind on trimming (461/128) > > max_segments: 128, num_segments: 461 > > mdsuser-fs-5668c75f9f-h8p2t(mds.0): Behind on trimming (342/128) > > max_segments: 128, num_segments: 342 > > mdsuser-fs-5668c75f9f-7gs67(mds.1): Behind on trimming (461/128) > > max_segments: 128, num_segments: 461 > > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Behind on trimming (386/128) > > max_segments: 128, num_segments: 386 > > mdsceph-fs-5b997cbf7b-hmrxr(mds.0): Behind on trimming (386/128) > > max_segments: 128, num_segments: 386 > > PG_AVAILABILITY Reduced data availability: 2 pgs stale > > pg 4.4f is stuck stale for 171783.855465, current state > > stale+active+undersized+degraded, last acting [24] > > pg 4.59 is stuck stale for 171751.961506, current state > > stale+active+undersized+degraded, last acting [13] > > PG_DEGRADED Degraded data redundancy: 2593/186805106 objects degraded > > (0.001%), 2 pgs degraded, 2 pgs undersized > > pg 4.4f is stuck undersized for 171797.245359, current state > > stale+active+undersized+degraded, last acting [24]> pg 4.59 is stuck > > undersized for 171797.257707, current state > stale+active+undersized+degraded, last acting [13] > > So where are osd.24 and osd.13? > > To which pool do these PGs belong? > > But these PGs are probably the root-cause of all the issues you are seeing. > Both
Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
On 7/25/19 7:49 AM, Sangwhan Moon wrote: > Hello, > > Original Message: >> >> >> On 7/25/19 6:49 AM, Sangwhan Moon wrote: >>> Hello, >>> >>> I've inherited a Ceph cluster from someone who has left zero documentation >>> or any handover. A couple days ago it decided to show the entire company >>> what it is capable of.. >>> >>> The health report looks like this: >>> >>> [root@host mnt]# ceph -s >>> cluster: >>> id: 809718aa-3eac-4664-b8fa-38c46cdbfdab >>> health: HEALTH_ERR >>> 1 MDSs report damaged metadata >>> 1 MDSs are read only >>> 2 MDSs report slow requests >>> 6 MDSs behind on trimming >>> Reduced data availability: 2 pgs stale >>> Degraded data redundancy: 2593/186803520 objects degraded >>> (0.001%), 2 pgs degraded, 2 pgs undersized >>> 1 slow requests are blocked > 32 sec. Implicated osds >>> 716 stuck requests are blocked > 4096 sec. Implicated osds >>> 25,31,38\ >> >> I would start here: >> >>> >>> services: >>> mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0 >>> mgr: a(active) >>> mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up >>> {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf >>> 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active}, >>> 4 up:sta >>> ndby-replay >>> osd: 39 osds: 39 up, 38 in >>> >>> data: >>> pools: 5 pools, 706 pgs >>> objects: 91212k objects, 4415 GB >>> usage: 10415 GB used, 13024 GB / 23439 GB avail >>> pgs: 2593/186803520 objects degraded (0.001%) >>> 703 active+clean >>> 2 stale+active+undersized+degraded >> >> This is a problem! Can you check: >> >> $ ceph pg dump_stuck >> >> The PGs will start with a number like 8.1a where '8' it the pool ID. >> >> Then check: >> >> $ ceph df >> >> To which pools to those PGs belong? >> >> Then check: >> >> $ ceph pg query >> >> And the bottom somewhere should show why these PGs are not active. You >> might even want to try a restart of these OSDs involved with those two PGs. > > Thanks a lot for the suggestions - I just checked and it says that the > problematic PGs are 4.4f and 4.59 - but querying those seem result in the > following error: > > Error ENOENT: i don't have pgid 4.4f > > (same applies for 4.59 - they do seem to show up in "ceph pg ls" though.) > > In ceph pg ls, it shows that for these PGs UP, UP_PRIMARY ACTING, > ACTING_PRIMARY all only have one OSD associated with it. (24, 13 - although > both the PG ID mentioned above and these numbers probably don't help much > with the diagnosis) Should restarting be a safe thing to try first? > > ceph health detail says the following: > > MDS_DAMAGE 1 MDSs report damaged metadata > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Metadata damage detected > MDS_READ_ONLY 1 MDSs are read only > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): MDS in read-only mode > MDS_SLOW_REQUEST 2 MDSs report slow requests > mdsuser-fs-5668c75f9f-hflps(mds.0): 3 slow requests are blocked > 30 sec > mdsuser-fs-5668c75f9f-jf59x(mds.1): 980 slow requests are blocked > 30 sec > MDS_TRIM 6 MDSs behind on trimming > mdsuser-fs-5668c75f9f-hflps(mds.0): Behind on trimming (342/128) > max_segments: 128, num_segments: 342 > mdsuser-fs-5668c75f9f-jf59x(mds.1): Behind on trimming (461/128) > max_segments: 128, num_segments: 461 > mdsuser-fs-5668c75f9f-h8p2t(mds.0): Behind on trimming (342/128) > max_segments: 128, num_segments: 342 > mdsuser-fs-5668c75f9f-7gs67(mds.1): Behind on trimming (461/128) > max_segments: 128, num_segments: 461 > mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Behind on trimming (386/128) > max_segments: 128, num_segments: 386 > mdsceph-fs-5b997cbf7b-hmrxr(mds.0): Behind on trimming (386/128) > max_segments: 128, num_segments: 386 > PG_AVAILABILITY Reduced data availability: 2 pgs stale > pg 4.4f is stuck stale for 171783.855465, current state > stale+active+undersized+degraded, last acting [24] > pg 4.59 is stuck stale for 171751.961506, current state > stale+active+undersized+degraded, last acting [13] > PG_DEGRADED Degraded data redundancy: 2593/186805106 objects degraded > (0.001%), 2 pgs degraded, 2 pgs undersized > pg 4.4f is stuck undersized for 171797.245359, current state > stale+active+undersized+degraded, last acting [24]> pg 4.59 is stuck > undersized for 171797.257707, current state stale+active+undersized+degraded, last acting [13] So where are osd.24 and osd.13? To which pool do these PGs belong? But these PGs are probably the root-cause of all the issues you are seeing. Wido > REQUEST_SLOW 3 slow requests are blocked > 32 sec. Implicated osds > 3 ops are blocked > 2097.15 sec > REQUEST_STUCK 717 stuck requests are blocked > 4096 sec. Implicated osds > 25,31,38 > 286 ops are blocked > 268435 sec > 211 ops are blocked > 134218 sec >
Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
Original Message: > On Thu, 25 Jul 2019 13:49:22 +0900 Sangwhan Moon wrote: > > > osd: 39 osds: 39 up, 38 in > > You might want to find that out OSD. Thanks, I've identified the OSD and put it back in - doesn't seem to change anything though. :( Sangwhan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
On Thu, 25 Jul 2019 13:49:22 +0900 Sangwhan Moon wrote: > osd: 39 osds: 39 up, 38 in You might want to find that out OSD. -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Mobile Inc. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
Hello, Original Message: > > > On 7/25/19 6:49 AM, Sangwhan Moon wrote: > > Hello, > > > > I've inherited a Ceph cluster from someone who has left zero documentation > > or any handover. A couple days ago it decided to show the entire company > > what it is capable of.. > > > > The health report looks like this: > > > > [root@host mnt]# ceph -s > > cluster: > > id: 809718aa-3eac-4664-b8fa-38c46cdbfdab > > health: HEALTH_ERR > > 1 MDSs report damaged metadata > > 1 MDSs are read only > > 2 MDSs report slow requests > > 6 MDSs behind on trimming > > Reduced data availability: 2 pgs stale > > Degraded data redundancy: 2593/186803520 objects degraded > > (0.001%), 2 pgs degraded, 2 pgs undersized > > 1 slow requests are blocked > 32 sec. Implicated osds > > 716 stuck requests are blocked > 4096 sec. Implicated osds > > 25,31,38\ > > I would start here: > > > > > services: > > mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0 > > mgr: a(active) > > mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up > > {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf > > 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active}, > > 4 up:sta > > ndby-replay > > osd: 39 osds: 39 up, 38 in > > > > data: > > pools: 5 pools, 706 pgs > > objects: 91212k objects, 4415 GB > > usage: 10415 GB used, 13024 GB / 23439 GB avail > > pgs: 2593/186803520 objects degraded (0.001%) > > 703 active+clean > > 2 stale+active+undersized+degraded > > This is a problem! Can you check: > > $ ceph pg dump_stuck > > The PGs will start with a number like 8.1a where '8' it the pool ID. > > Then check: > > $ ceph df > > To which pools to those PGs belong? > > Then check: > > $ ceph pg query > > And the bottom somewhere should show why these PGs are not active. You > might even want to try a restart of these OSDs involved with those two PGs. Thanks a lot for the suggestions - I just checked and it says that the problematic PGs are 4.4f and 4.59 - but querying those seem result in the following error: Error ENOENT: i don't have pgid 4.4f (same applies for 4.59 - they do seem to show up in "ceph pg ls" though.) In ceph pg ls, it shows that for these PGs UP, UP_PRIMARY ACTING, ACTING_PRIMARY all only have one OSD associated with it. (24, 13 - although both the PG ID mentioned above and these numbers probably don't help much with the diagnosis) Should restarting be a safe thing to try first? ceph health detail says the following: MDS_DAMAGE 1 MDSs report damaged metadata mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Metadata damage detected MDS_READ_ONLY 1 MDSs are read only mdsceph-fs-5b997cbf7b-5tjwh(mds.0): MDS in read-only mode MDS_SLOW_REQUEST 2 MDSs report slow requests mdsuser-fs-5668c75f9f-hflps(mds.0): 3 slow requests are blocked > 30 sec mdsuser-fs-5668c75f9f-jf59x(mds.1): 980 slow requests are blocked > 30 sec MDS_TRIM 6 MDSs behind on trimming mdsuser-fs-5668c75f9f-hflps(mds.0): Behind on trimming (342/128) max_segments: 128, num_segments: 342 mdsuser-fs-5668c75f9f-jf59x(mds.1): Behind on trimming (461/128) max_segments: 128, num_segments: 461 mdsuser-fs-5668c75f9f-h8p2t(mds.0): Behind on trimming (342/128) max_segments: 128, num_segments: 342 mdsuser-fs-5668c75f9f-7gs67(mds.1): Behind on trimming (461/128) max_segments: 128, num_segments: 461 mdsceph-fs-5b997cbf7b-5tjwh(mds.0): Behind on trimming (386/128) max_segments: 128, num_segments: 386 mdsceph-fs-5b997cbf7b-hmrxr(mds.0): Behind on trimming (386/128) max_segments: 128, num_segments: 386 PG_AVAILABILITY Reduced data availability: 2 pgs stale pg 4.4f is stuck stale for 171783.855465, current state stale+active+undersized+degraded, last acting [24] pg 4.59 is stuck stale for 171751.961506, current state stale+active+undersized+degraded, last acting [13] PG_DEGRADED Degraded data redundancy: 2593/186805106 objects degraded (0.001%), 2 pgs degraded, 2 pgs undersized pg 4.4f is stuck undersized for 171797.245359, current state stale+active+undersized+degraded, last acting [24] pg 4.59 is stuck undersized for 171797.257707, current state stale+active+undersized+degraded, last acting [13] REQUEST_SLOW 3 slow requests are blocked > 32 sec. Implicated osds 3 ops are blocked > 2097.15 sec REQUEST_STUCK 717 stuck requests are blocked > 4096 sec. Implicated osds 25,31,38 286 ops are blocked > 268435 sec 211 ops are blocked > 134218 sec 5 ops are blocked > 67108.9 sec 2 ops are blocked > 33554.4 sec 134 ops are blocked > 16777.2 sec 79 ops are blocked > 8388.61 sec osds 25,31,38 have stuck requests > 268435 sec Cheers, Sangwhan > > Wido > > > 1 active+clean+scrubbing+deep > > > > io: > > client: 168
Re: [ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
On 7/25/19 6:49 AM, Sangwhan Moon wrote: > Hello, > > I've inherited a Ceph cluster from someone who has left zero documentation or > any handover. A couple days ago it decided to show the entire company what it > is capable of.. > > The health report looks like this: > > [root@host mnt]# ceph -s > cluster: > id: 809718aa-3eac-4664-b8fa-38c46cdbfdab > health: HEALTH_ERR > 1 MDSs report damaged metadata > 1 MDSs are read only > 2 MDSs report slow requests > 6 MDSs behind on trimming > Reduced data availability: 2 pgs stale > Degraded data redundancy: 2593/186803520 objects degraded > (0.001%), 2 pgs degraded, 2 pgs undersized > 1 slow requests are blocked > 32 sec. Implicated osds > 716 stuck requests are blocked > 4096 sec. Implicated osds > 25,31,38\ I would start here: > > services: > mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0 > mgr: a(active) > mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up > {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf > 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active}, > 4 up:sta > ndby-replay > osd: 39 osds: 39 up, 38 in > > data: > pools: 5 pools, 706 pgs > objects: 91212k objects, 4415 GB > usage: 10415 GB used, 13024 GB / 23439 GB avail > pgs: 2593/186803520 objects degraded (0.001%) > 703 active+clean > 2 stale+active+undersized+degraded This is a problem! Can you check: $ ceph pg dump_stuck The PGs will start with a number like 8.1a where '8' it the pool ID. Then check: $ ceph df To which pools to those PGs belong? Then check: $ ceph pg query And the bottom somewhere should show why these PGs are not active. You might even want to try a restart of these OSDs involved with those two PGs. Wido > 1 active+clean+scrubbing+deep > > io: > client: 168 kB/s rd, 6336 B/s wr, 10 op/s rd, 1 op/s wr > > The offending broken MDS entry (damaged metadata) seems to be this: > > mds.ceph-fs-5b997cbf7b-5tjwh: [ > { > "damage_type": "dir_frag", > "id": 1190692215, > "ino": 2199023258131, > "frag": "*", > "path": "/f/01/59" > } > ] > > Is there any idea how I can diagnose and find out what is wrong? For the > other issues I'm not even sure what/where I need to look into. > > Cheers, > Sangwhan > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] HEALTH_ERR with a kitchen sink of problems: MDS damaged, readonly, and so forth
Hello, I've inherited a Ceph cluster from someone who has left zero documentation or any handover. A couple days ago it decided to show the entire company what it is capable of.. The health report looks like this: [root@host mnt]# ceph -s cluster: id: 809718aa-3eac-4664-b8fa-38c46cdbfdab health: HEALTH_ERR 1 MDSs report damaged metadata 1 MDSs are read only 2 MDSs report slow requests 6 MDSs behind on trimming Reduced data availability: 2 pgs stale Degraded data redundancy: 2593/186803520 objects degraded (0.001%), 2 pgs degraded, 2 pgs undersized 1 slow requests are blocked > 32 sec. Implicated osds 716 stuck requests are blocked > 4096 sec. Implicated osds 25,31,38 services: mon: 3 daemons, quorum f,rook-ceph-mon2,rook-ceph-mon0 mgr: a(active) mds: ceph-fs-2/2/2 up odd-fs-2/2/2 up {[ceph-fs:0]=ceph-fs-5b997cbf7b-5tjwh=up:active,[ceph-fs:1]=ceph-fs-5b997cbf 7b-nstqz=up:active,[user-fs:0]=odd-fs-5668c75f9f-hflps=up:active,[user-fs:1]=odd-fs-5668c75f9f-jf59x=up:active}, 4 up:sta ndby-replay osd: 39 osds: 39 up, 38 in data: pools: 5 pools, 706 pgs objects: 91212k objects, 4415 GB usage: 10415 GB used, 13024 GB / 23439 GB avail pgs: 2593/186803520 objects degraded (0.001%) 703 active+clean 2 stale+active+undersized+degraded 1 active+clean+scrubbing+deep io: client: 168 kB/s rd, 6336 B/s wr, 10 op/s rd, 1 op/s wr The offending broken MDS entry (damaged metadata) seems to be this: mds.ceph-fs-5b997cbf7b-5tjwh: [ { "damage_type": "dir_frag", "id": 1190692215, "ino": 2199023258131, "frag": "*", "path": "/f/01/59" } ] Is there any idea how I can diagnose and find out what is wrong? For the other issues I'm not even sure what/where I need to look into. Cheers, Sangwhan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com