After reducing the recovery parameter values did not change much.
There are a lot of OSD still marked down.
I don't know what I need to do after this point.
[osd]
osd recovery op priority = 63
osd client op priority = 1
osd recovery max active = 1
osd max scrubs = 1
ceph -s
cluster:
id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
health: HEALTH_ERR
42 osds down
1 host (6 osds) down
61/8948582 objects unfound (0.001%)
Reduced data availability: 3837 pgs inactive, 1822 pgs
down, 1900 pgs peering, 6 pgs stale
Possible data damage: 18 pgs recovery_unfound
Degraded data redundancy: 457246/17897164 objects degraded
(2.555%), 213 pgs degraded, 209 pgs undersized
2554 slow requests are blocked > 32 sec
3273 slow ops, oldest one blocked for 1453 sec, daemons
[osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
have slow ops.
services:
mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3,
SRV-SEKUARK4
osd: 168 osds: 118 up, 160 in
data:
pools: 1 pools, 4096 pgs
objects: 8.95 M objects, 17 TiB
usage: 33 TiB used, 553 TiB / 586 TiB avail
pgs: 93.677% pgs not active
457246/17897164 objects degraded (2.555%)
61/8948582 objects unfound (0.001%)
1676 down
1372 peering
528 stale+peering
164 active+undersized+degraded
145 stale+down
73 activating
40 active+clean
29 stale+activating
17 active+recovery_unfound+undersized+degraded
16 stale+active+clean
16 stale+active+undersized+degraded
9 activating+undersized+degraded
3 active+recovery_wait+degraded
2 activating+undersized
2 activating+degraded
1 creating+down
1 stale+active+recovery_unfound+undersized+degraded
1 stale+active+clean+scrubbing+deep
1 stale+active+recovery_wait+degraded
ceph -w: https://paste.ubuntu.com/p/WZ2YqzS86S/
ceph health detail: https://paste.ubuntu.com/p/8w7Jpms8fj/
by morphin <[email protected]>, 25 Eyl 2018 Sal, 14:32
tarihinde şunu yazdı:
>
> The config didnt work. Because increasing the number faced with more OSD
> Drops.
>
> bhfs -s
> cluster:
> id: 89569e73-eb89-41a4-9fc9-d2a5ec5f4106
> health: HEALTH_ERR
> norebalance,norecover flag(s) set
> 1 osds down
> 17/8839434 objects unfound (0.000%)
> Reduced data availability: 3578 pgs inactive, 861 pgs
> down, 1928 pgs peering, 11 pgs stale
> Degraded data redundancy: 44853/17678868 objects degraded
> (0.254%), 221 pgs degraded, 20 pgs undersized
> 610 slow requests are blocked > 32 sec
> 3996 stuck requests are blocked > 4096 sec
> 6076 slow ops, oldest one blocked for 4129 sec, daemons
> [osd.0,osd.1,osd.10,osd.100,osd.101,osd.102,osd.103,osd.104,osd.105,osd.106]...
> have slow ops.
>
> services:
> mon: 3 daemons, quorum SRV-SEKUARK3,SRV-SBKUARK2,SRV-SBKUARK3
> mgr: SRV-SBKUARK2(active), standbys: SRV-SEKUARK2, SRV-SEKUARK3
> osd: 168 osds: 128 up, 129 in; 2 remapped pgs
> flags norebalance,norecover
>
> data:
> pools: 1 pools, 4096 pgs
> objects: 8.84 M objects, 17 TiB
> usage: 26 TiB used, 450 TiB / 477 TiB avail
> pgs: 0.024% pgs unknown
> 89.160% pgs not active
> 44853/17678868 objects degraded (0.254%)
> 17/8839434 objects unfound (0.000%)
> 1612 peering
> 720 down
> 583 activating
> 319 stale+peering
> 255 active+clean
> 157 stale+activating
> 108 stale+down
> 95 activating+degraded
> 84 stale+active+clean
> 50 active+recovery_wait+degraded
> 29 creating+down
> 23 stale+activating+degraded
> 18 stale+active+recovery_wait+degraded
> 14 active+undersized+degraded
> 12 active+recovering+degraded
> 4 stale+creating+down
> 3 stale+active+recovering+degraded
> 3 stale+active+undersized+degraded
> 2 stale
> 1 active+recovery_wait+undersized+degraded
> 1 active+clean+scrubbing+deep
> 1 unknown
> 1 active+undersized+degraded+remapped+backfilling
> 1 active+recovering+undersized+degraded
>
> I guess OSD down and drop issue increases the recovery time. So I
> decided to try with decreasing recovery parameters for less load on
> cluster.
> I have Nvme and SAS disks. Servers are powerfull enough. Network is 4x10Gb.
> I dont think my cluster is a bad shape. Because I have datacenter
> redundancy (14 servers + 14 servers). The crashed 7 servers are on
> only datacenter A. And it took only a few minutes to back online. Also
> 2 of them is monitors and cluster I/O should be suspended so there
> should be less data difference.
>
> On the other hand I dont understand the burden of recovery. I have
> faced many recoverys but none of the stopped my cluster working. This
> recovery burden is so high that it didnt stop for hours. I wish I
> could just decrease the recovery speed and continue to serve my VMs.
> Is the change of recovery load some what different than mimic?
> Luminous was pretty fine indeed.
> by morphin <[email protected]>, 25 Eyl 2018 Sal, 13:57
> tarihinde şunu yazdı:
> >
> > Thank you for answer
> >
> > What do you think the conf for speed the recover?
> >
> > [osd]
> > osd recovery op priority = 63
> > osd client op priority = 1
> > osd recovery max active = 16
> > osd max scrubs = 16
> > <[email protected]> adresine sahip kullanıcı 25 Eyl 2018 Sal,
> > 13:37 tarihinde şunu yazdı:
> > >
> > > Just let it recover.
> > >
> > > data:
> > > pools: 1 pools, 4096 pgs
> > > objects: 8.95 M objects, 17 TiB
> > > usage: 34 TiB used, 577 TiB / 611 TiB avail
> > > pgs: 94.873% pgs not active
> > > 48475/17901254 objects degraded (0.271%)
> > > 1/8950627 objects unfound (0.000%)
> > > 2631 peering
> > > 637 activating
> > > 562 down
> > > 159 active+clean
> > > 44 activating+degraded
> > > 30 active+recovery_wait+degraded
> > > 12 activating+undersized+degraded
> > > 10 active+recovering+degraded
> > > 10 active+undersized+degraded
> > > 1 active+clean+scrubbing+deep
> > >
> > > You've got deep scrubbed PGs which put considerable IO load on OSDs.
> > >
> > >
> > > September 25, 2018 1:23 PM, "by morphin" <[email protected]> wrote:
> > >
> > >
> > > > What should I do now?
> > > >
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com