On Wed, Jan 31, 2018 at 9:01 PM Philip Poten <philip.po...@gmail.com> wrote:
> 2018-01-31 19:20 GMT+01:00 Gregory Farnum <gfar...@redhat.com>: > >> On Wed, Jan 31, 2018 at 1:40 AM Philip Poten <philip.po...@gmail.com> >> wrote: >> > Hello, >>> >>> i have this error message: >>> >>> 2018-01-25 00:59:27.357916 7fd646ae1700 -1 osd.3 pg_epoch: 9393 >>> pg[9.139s0( v 8799'82397 (5494'79049,8799'82397] local-lis/les=9392/9393 >>> n=10003 ec=1478/1478 lis/c 9392/6304 les/c/f 9393/6307/807 9391/9392/9392) >>> [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=9392 pi=[6304,9392)/3 bft=9(3),12(2) >>> crt=8799'82397 lcod 0'0 mlcod 0'0 >>> active+undersized+degraded+remapped+backfilling] recover_replicas: object >>> added to missing set for backfill, but is not in recovering, error! >>> >> >> The line prior to what you pasted should have the object name in it. >> > > Ok, so since I was unable to find this information with the usual methods, > I'll follow up on this: > > -2> 2018-02-01 04:27:27.414658 7f0c7a521700 -1 osd.3 pg_epoch: 10329 > pg[9.139s0( v 10329'83100 (5494'79049,10329'83100] > local-lis/les=10321/10322 n=9979 ec=1478/1478 lis/c 10321/6304 les/c/f > 10322/630 > 7/807 10318/10321/10318) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=10321 > pi=[6304,10321)/3 bft=9(3),12(2) crt=10329'83099 lcod 10329'83099 mlcod > 10329'83099 active+undersized+degraded+remapped+backfilling] re > cover_replicas: object 9:9ccec3b7:::1000021235e.000008dc:head > last_backfill 9:9ccec1a8:::100000f6bd9.000001a3:head > -1> 2018-02-01 04:27:27.414774 7f0c7a521700 -1 osd.3 pg_epoch: 10329 > pg[9.139s0( v 10329'83100 (5494'79049,10329'83100] > local-lis/les=10321/10322 n=9979 ec=1478/1478 lis/c 10321/6304 les/c/f > 10322/6307/807 10318/10321/10318) [3,6,12,9]/[3,6,2147483647,4] r=0 > lpr=10321 pi=[6304,10321)/3 bft=9(3),12(2) crt=10329'83099 lcod 10329'83099 > mlcod 10329'83099 active+undersized+degraded+remapped+backfilling] > recover_replicas: object added to missing set for backfill, but is not in > recovering, error! > 0> 2018-02-01 04:27:27.421623 7f0c7a521700 -1 *** Caught signal > (Aborted) ** > in thread 7f0c7a521700 thread_name:tp_osd_tp > ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous > (stable) > > So apparently, "cover_replicas: object > 9:9ccec3b7:::1000021235e.000008dc:head last_backfill > 9:9ccec1a8:::100000f6bd9.000001a3:head" is the interesting part. But what > exactly does it mean? > > First, I looked into the contents of the pool and what the keys looked > like (rados -p cephfs-data ls). So I figured, the object key isn't the > whole thing, but only the something.something part. Then I tried retrieving > them manually: > > root@lxt-prod-ceph-mon02:~# rados -p cephfs-data get > "1000021235e.000008dc" foo > error getting cephfs-data/1000021235e.000008dc: (5) Input/output error > root@lxt-prod-ceph-mon02:~# rados -p cephfs-data get > "100000f6bd9.000001a3" foo > root@lxt-prod-ceph-mon02:~# > > Which suggested, that indeed the 08dc key is the culprit, and 01a3 is > probably just the last object that was backfilled (?) - it also told me, > that the offending object luckily was part of the cephfs-data not the > cephfs-metadata pool. phew. > > So I tried removing the offending key: > > root@lxt-prod-ceph-mon02:~# rados -p cephfs-data rm > "1000021235e.000008dc" > root@lxt-prod-ceph-mon02:~# > > and restarted backfills. And wouldn't you believe it, it restarted > backfilling without the OSD crashing! > > I haven't found a way to find the cephfs file the object belongs to yet, > so if you can guide me with this, please let me know. But I'm sure sooner > or later it will make itself known anyway when someone attempts to read it > *cough*. > The object is named after the inode number in hex (1000021235e), and then which number object is is in the file (also in hex, starting from 0 — 000008dc). If you look at the zeroth object in that file, it will have a backtrace xattr which contains an encoded version of the file path. You can use the ceph-dencoder tool to look at the real data if you have to but just dumping it as ascii should get you there. -Greg > Thanks for your very helpful hint Greg! > > Philip > > PS: it was one damaged object that prevented me from moving the last > degraded pg completely off a broken harddisk and prevented the whole > cluster from being maintainable... that really shouldn't happen in my > opinion. > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com