On Tue, Oct 29, 2019 at 9:09 PM Jérémy Gardais
<[email protected]> wrote:
>
> Thus spake Brad Hubbard ([email protected]) on mardi 29 octobre 2019 à
> 08:20:31:
> > Yes, try and get the pgs healthy, then you can just re-provision the down
> > OSDs.
> >
> > Run a scrub on each of these pgs and then use the commands on the
> > following page to find out more information for each case.
> >
> > https://docs.ceph.com/docs/luminous/rados/troubleshooting/troubleshooting-pg/
> >
> > Focus on the commands 'list-missing', 'list-inconsistent-obj', and
> > 'list-inconsistent-snapset'.
> >
> > Let us know if you get stuck.
> >
> > P.S. There are several threads about these sorts of issues in this
> > mailing list that should turn up when doing a web search.
>
> I found this thread :
> https://www.mail-archive.com/[email protected]/msg53116.html
That looks like the same issue.
>
> And i start to get additionnals informations to solve PG 2.2ba :
> 1. rados list-inconsistent-snapset 2.2ba --format=json-pretty
> {
> "epoch": 192223,
> "inconsistents": [
> {
> "name": "rbd_data.b4537a2ae8944a.000000000000425f",
> "nspace": "",
> "locator": "",
> "snap": 22772,
> "errors": [
> "headless"
> ]
> },
> {
> "name": "rbd_data.b4537a2ae8944a.000000000000425f",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "snapset": {
> "snap_context": {
> "seq": 22806,
> "snaps": [
> 22805,
> 22804,
> 22674,
> 22619,
> 20536,
> 17248,
> 14270
> ]
> },
> "head_exists": 1,
> "clones": [
> {
> "snap": 17248,
> "size": 4194304,
> "overlap": "[0~2269184,2277376~1916928]",
> "snaps": [
> 17248
> ]
> },
> {
> "snap": 20536,
> "size": 4194304,
> "overlap": "[0~2269184,2277376~1916928]",
> "snaps": [
> 20536
> ]
> },
> {
> "snap": 22625,
> "size": 4194304,
> "overlap": "[0~2269184,2277376~1916928]",
> "snaps": [
> 22619
> ]
> },
> {
> "snap": 22674,
> "size": 4194304,
> "overlap": "[266240~4096]",
> "snaps": [
> 22674
> ]
> },
> {
> "snap": 22805,
> "size": 4194304,
> "overlap":
> "[0~942080,958464~901120,1875968~16384,1908736~360448,2285568~1908736]",
> "snaps": [
> 22805,
> 22804
> ]
> }
> ]
> },
> "errors": [
> "extra_clones"
> ],
> "extra clones": [
> 22772
> ]
> }
> ]
> }
>
> 2.a ceph-objectstore-tool from osd.29 and osd.42 :
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-29/ --pgid 2.2ba
> --op list rbd_data.b4537a2ae8944a.000000000000425f
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":20536,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22625,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22674,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22805,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":-2,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
>
> 2.b ceph-objectstore-tool from osd.30 :
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-30/ --pgid 2.2ba
> --op list rbd_data.b4537a2ae8944a.000000000000425f
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":17248,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":20536,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22625,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22674,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22805,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":-2,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
>
> I needed to shutdown the OSD service (30, 29 then 42) to be able to
> get any result. Otherwise i only had these errors :
> Mount failed with '(11) Resource temporarily unavailable'
> Or
> OSD has the store locked
Yes, the object store tool requires the OSD to be shut down.
>
>
>
> Without doing anything else, 2 OSDs start flapping (osd.38 and
> osd.27) with 1 PG switching between inactive, down and up… :
Maybe you should set nodown and noout while you do these maneuvers?
That will minimise peering and recovery (data movement).
>
> HEALTH_ERR 2 osds down; 12128/37456062 objects misplaced (0.032%); 4 scrub
> errors; Reduced data availability: 1 pg inactive, 1 pg down; Possible data
> damage: 2 pgs inconsistent; Degraded data redundancy: 2264342/37456062
> objects degraded (6.045%), 859 pgs degraded
> OSD_DOWN 2 osds down
> osd.27 (root=default,datacenter=IPR,room=11B,rack=baie2,host=r730xd3) is
> down
> osd.38 (root=default,datacenter=IPR,room=11B,rack=baie2,host=r740xd1) is
> down
> OBJECT_MISPLACED 12128/37456062 objects misplaced (0.032%)
> OSD_SCRUB_ERRORS 4 scrub errors
> PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg down
> pg 2.448 is down, acting [0]
> PG_DAMAGED Possible data damage: 2 pgs inconsistent
> pg 2.2ba is active+clean+inconsistent, acting [42,29,30]
> pg 2.2bb is active+clean+inconsistent, acting [25,42,18]
> pg 2.371 is
> active+undersized+degraded+remapped+inconsistent+backfill_wait,acting [42,9]
> …
>
>
> If i correctly understood the previous thread, i should remove the
> snapid 22772 from osd.29 and osd.42 :
> ceph-objectstore-tool --pgid 2.2ba --data-path /var/lib/ceph/osd/ceph-29/
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> remove
> ceph-objectstore-tool --pgid 2.2ba --data-path /var/lib/ceph/osd/ceph-42/
> ["2.2ba",{"oid":"rbd_data.b4537a2ae8944a.000000000000425f","key":"","snapid":22772,"hash":719609530,"max":0,"pool":2,"namespace":"","max":0}]
> remove
That looks right.
>
> Still need to shutdown the service before or i miss an important thing ?
Yes.
>
> Sorry for the noob's noise, i not really confortable with the current
> state of my cluster -_-
You should probably try and work out what caused the issue and take
steps to minimise the likelihood of a recurrence. This is not expected
behaviour in a correctly configured and stable environment.
>
> --
> Gardais Jérémy
> Institut de Physique de Rennes
> Université Rennes 1
> Téléphone: 02-23-23-68-60
> Mail & bonnes pratiques: http://fr.wikipedia.org/wiki/Nétiquette
> -------------------------------
--
Cheers,
Brad
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com