The "best" health i was able to get was :
HEALTH_ERR norecover flag(s) set; 1733/37482459 objects misplaced (0.005%); 5
scrub errors; Possible data damage: 2 pgs inconsistent; Degraded data
redundancy: 7461/37482459 objects degraded (0.020%), 24 pgs degraded, 2 pgs
undersized
OSDMAP_FLAGS norecover flag(s) set
OBJECT_MISPLACED 1733/37482459 objects misplaced (0.005%)
OSD_SCRUB_ERRORS 5 scrub errors
PG_DAMAGED Possible data damage: 2 pgs inconsistent
pg 2.2ba is active+clean+inconsistent, acting [42,29,30]
pg 2.2bb is active+clean+inconsistent, acting [25,42,18]
PG_DEGRADED Degraded data redundancy: 7461/37482459 objects degraded (0.020%),
24 pgs degraded, 2 pgs undersized
pg 2.3e is active+recovery_wait+degraded, acting [27,31,5]
pg 2.9d is active+recovery_wait+degraded, acting [27,22,37]
pg 2.a3 is active+recovery_wait+degraded, acting [27,30,35]
pg 2.136 is active+recovery_wait+degraded, acting [27,18,22]
pg 2.150 is active+recovery_wait+degraded, acting [27,19,35]
pg 2.15e is active+recovery_wait+degraded, acting [27,11,36]
pg 2.1d9 is stuck undersized for 14023.243179, current state
active+undersized+degraded+remapped+backfill_wait, last acting [25,30]
pg 2.20f is active+recovery_wait+degraded, acting [27,30,2]
pg 2.2a1 is active+recovery_wait+degraded, acting [27,18,35]
pg 2.2b7 is active+recovery_wait+degraded, acting [27,18,36]
pg 2.386 is active+recovery_wait+degraded, acting [27,42,17]
pg 2.391 is active+recovery_wait+degraded, acting [27,15,36]
pg 2.448 is stuck undersized for 51520.798900, current state
active+recovery_wait+undersized+degraded+remapped, last acting [27,38]
pg 2.456 is active+recovery_wait+degraded, acting [27,5,43]
pg 2.45a is active+recovery_wait+degraded, acting [27,43,36]
pg 2.45f is active+recovery_wait+degraded, acting [27,16,36]
pg 2.46c is active+recovery_wait+degraded, acting [27,30,38]
pg 2.4bf is active+recovery_wait+degraded, acting [27,39,18]
pg 2.522 is active+recovery_wait+degraded, acting [27,17,3]
pg 2.535 is active+recovery_wait+degraded, acting [27,29,36]
pg 2.55a is active+recovery_wait+degraded, acting [27,29,18]
pg 5.23f is active+recovery_wait+degraded, acting [27,39,18]
pg 5.356 is active+recovery_wait+degraded, acting [27,36,15]
pg 5.4a6 is active+recovery_wait+degraded, acting [29,40,30]
After that, the flapping started again :
2019-10-30 12:55:46.772593 mon.r730xd1 [INF] osd.38 failed
(root=default,datacenter=IPR,room=11B,rack=baie2,host=r740xd1) (connection
refused reported by osd.22)
2019-10-30 12:55:46.850239 mon.r730xd1 [INF] osd.27 failed
(root=default,datacenter=IPR,room=11B,rack=baie2,host=r730xd3) (connection
refused reported by osd.19)
2019-10-30 12:55:56.714029 mon.r730xd1 [WRN] Health check update: 2 osds down
(OSD_DOWN)
Setting "norecover" flag allow these 2 OSDs to recover up state and
limit the flapping states and many backfills.
In both osd.27 and osd.38 logs, i still find these logs before one
FAILED assert :
-2> 2019-10-30 12:52:31.999571 7fb5c125b700 5 -- 129.20.177.3:6802/870834
>> 129.20.177.3:6808/810999 conn(0x564c31d31000 :6802
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=97 cs=7 l=0). rx osd.25 seq
2600 0x564c3a02e1c0 MOSDPGPush(2.1d9 194334/194298
[PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version:
127481'7241006, data_included: [], data_size: 0, omap_header_size: 0,
omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006,
size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]),
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3
-1> 2019-10-30 12:52:31.999633 7fb5c125b700 1 -- 129.20.177.3:6802/870834
<== osd.25 129.20.177.3:6808/810999 2600 ==== MOSDPGPush(2.1d9 194334/194298
[PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version:
127481'7241006, data_included: [], data_size: 0, omap_header_size: 0,
omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006,
size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]),
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false, error:false))])
v3 ==== 909+0+0 (4137937376 0 0) 0x564c3a02e1c0 con 0x564c31d31000
0> 2019-10-30 12:52:32.008605 7fb59b03d700 -1
/build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&,
ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fb59b03d700 time
2019-10-30 12:52:32.000033 /build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: 354:
FAILED assert(recovery_info.oi.legacy_snaps.size())
-2> 2019-10-30 12:52:31.998835 7fce6189b700 5 -- 129.20.177.4:6810/2125155
>> 129.20.177.3:6808/810999 conn(0x5648d74a0800 :-1
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=76 cs=1 l=0). rx osd.25 seq
24804 0x5648d708eac0 MOSDPGPush(2.1d9 194334/194298
[PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version:
127481'7241006, data_included: [], data_size: 0, omap_header_size: 0,
omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006,
size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]),
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false, error:false))]) v3
-1> 2019-10-30 12:52:31.998899 7fce6189b700 1 -- 129.20.177.4:6810/2125155
<== osd.25 129.20.177.3:6808/810999 24804 ==== MOSDPGPush(2.1d9 194334/194298
[PushOp(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926, version:
127481'7241006, data_included: [], data_size: 0, omap_header_size: 0,
omap_entries_size: 0, attrset_size: 1, recovery_info:
ObjectRecoveryInfo(2:9b97b818:::rbd_data.0c16b76b8b4567.000000000001426e:5926@127481'7241006,
size: 4194304, copy_subset: [], clone_subset: {}, snapset: 0=[]:[]),
after_progress: ObjectRecoveryProgress(!first, data_recovered_to:0,
data_complete:true, omap_recovered_to:, omap_complete:true, error:false),
before_progress: ObjectRecoveryProgress(first, data_recovered_to:0,
data_complete:false, omap_recovered_to:, omap_complete:false, error:false))])
v3 ==== 925+0+0 (542499734 0 0) 0x5648d708eac0 con 0x5648d74a0800
0> 2019-10-30 12:52:32.003339 7fce4803e700 -1
/build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: In function 'virtual void
PrimaryLogPG::on_local_recover(const hobject_t&, const ObjectRecoveryInfo&,
ObjectContextRef, bool, ObjectStore::Transaction*)' thread 7fce4803e700 time
2019-10-30 12:52:31.999086
/build/ceph-12.2.12/src/osd/PrimaryLogPG.cc: 354: FAILED
assert(recovery_info.oi.legacy_snaps.size())
More informations about this object (0c16b76b8b4567.000000000001426e)
from pg 2.1d9 :
ceph osd map rbd rbd_data.0c16b76b8b4567.000000000001426e
osdmap e194356 pool 'rbd' (2) object 'rbd_data.0c16b76b8b4567.000000000001426e'
-> pg 2.181de9d9 (2.1d9) -> up ([27,30,38], p27) acting ([30,25], p30)
I also checked the logs of all OSDs already done and got the same logs
about this object :
* osd.4, last time : 2019-10-10 16:15:20
* osd.32, last time : 2019-10-14 01:54:56
* osd.33, last time : 2019-10-11 06:24:01
* osd.34, last time : 2019-10-18 06:24:26
* osd.20, last time : 2019-10-27 18:12:31
* osd.28, last time : 2019-10-28 12:57:47
No matter that the data came from osd.25 or osd.30, i have the same
error. It seems this PG|object try to recover an healthy state but
shutdown my OSDs one by one…
Thus spake Jérémy Gardais ([email protected]) on mercredi 30
octobre 2019 à 11:09:36:
> Thus spake Brad Hubbard ([email protected]) on mercredi 30 octobre 2019 à
> 12:50:50:
> > You should probably try and work out what caused the issue and take
> > steps to minimise the likelihood of a recurrence. This is not expected
> > behaviour in a correctly configured and stable environment.
This PG 2.1d9 is "only" marked as :
"active+undersized+degraded+remapped+backfill_wait", not inconsistent…
Everything i got from PG 2.1d9 (query, list_missing,
ceph-objectstore-tool list,…) is available here :
https://cloud.ipr.univ-rennes1.fr/index.php/s/BYtuAURnC7YOAQG?path=%2Fpg.2.1d9
But nothing looks suspicious to me…
I also separated the logs from the last error on osd.27 and it's
reboot ("only" ~22k lines ^^) :
https://cloud.ipr.univ-rennes1.fr/index.php/s/BYtuAURnC7YOAQG/download?path=%2F&files=ceph-osd.27.log.last.error.txt
Is anybody understand these logs or do i have to leave with this damned
object ? ^^
--
Gardais Jérémy
Institut de Physique de Rennes
Université Rennes 1
Téléphone: 02-23-23-68-60
Mail & bonnes pratiques: http://fr.wikipedia.org/wiki/Nétiquette
-------------------------------
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com