On Wed, Jan 31, 2018 at 9:01 PM Philip Poten <philip.po...@gmail.com> wrote:

> 2018-01-31 19:20 GMT+01:00 Gregory Farnum <gfar...@redhat.com>:
>
>> On Wed, Jan 31, 2018 at 1:40 AM Philip Poten <philip.po...@gmail.com>
>> wrote:
>>
> Hello,
>>>
>>> i have this error message:
>>>
>>> 2018-01-25 00:59:27.357916 7fd646ae1700 -1 osd.3 pg_epoch: 9393
>>> pg[9.139s0( v 8799'82397 (5494'79049,8799'82397] local-lis/les=9392/9393
>>> n=10003 ec=1478/1478 lis/c 9392/6304 les/c/f 9393/6307/807 9391/9392/9392)
>>> [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=9392 pi=[6304,9392)/3 bft=9(3),12(2)
>>> crt=8799'82397 lcod 0'0 mlcod 0'0
>>> active+undersized+degraded+remapped+backfilling] recover_replicas: object
>>> added to missing set for backfill, but is not in recovering, error!
>>>
>>
>> The line prior to what you pasted should have the object name in it.
>>
>
> Ok, so since I was unable to find this information with the usual methods,
> I'll follow up on this:
>
>     -2> 2018-02-01 04:27:27.414658 7f0c7a521700 -1 osd.3 pg_epoch: 10329
> pg[9.139s0( v 10329'83100 (5494'79049,10329'83100]
> local-lis/les=10321/10322 n=9979 ec=1478/1478 lis/c 10321/6304 les/c/f
> 10322/630
> 7/807 10318/10321/10318) [3,6,12,9]/[3,6,2147483647,4] r=0 lpr=10321
> pi=[6304,10321)/3 bft=9(3),12(2) crt=10329'83099 lcod 10329'83099 mlcod
> 10329'83099 active+undersized+degraded+remapped+backfilling] re
> cover_replicas: object 9:9ccec3b7:::1000021235e.000008dc:head
> last_backfill 9:9ccec1a8:::100000f6bd9.000001a3:head
>     -1> 2018-02-01 04:27:27.414774 7f0c7a521700 -1 osd.3 pg_epoch: 10329
> pg[9.139s0( v 10329'83100 (5494'79049,10329'83100]
> local-lis/les=10321/10322 n=9979 ec=1478/1478 lis/c 10321/6304 les/c/f
> 10322/6307/807 10318/10321/10318) [3,6,12,9]/[3,6,2147483647,4] r=0
> lpr=10321 pi=[6304,10321)/3 bft=9(3),12(2) crt=10329'83099 lcod 10329'83099
> mlcod 10329'83099 active+undersized+degraded+remapped+backfilling]
> recover_replicas: object added to missing set for backfill, but is not in
> recovering, error!
>      0> 2018-02-01 04:27:27.421623 7f0c7a521700 -1 *** Caught signal
> (Aborted) **
>  in thread 7f0c7a521700 thread_name:tp_osd_tp
>  ceph version 12.2.2 (cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous
> (stable)
>
> So apparently, "cover_replicas: object
> 9:9ccec3b7:::1000021235e.000008dc:head last_backfill
> 9:9ccec1a8:::100000f6bd9.000001a3:head" is the interesting part. But what
> exactly does it mean?
>
> First, I looked into the contents of the pool and what the keys looked
> like (rados -p cephfs-data ls). So I figured, the object key isn't the
> whole thing, but only the something.something part. Then I tried retrieving
> them manually:
>
> root@lxt-prod-ceph-mon02:~# rados -p cephfs-data get
> "1000021235e.000008dc" foo
> error getting cephfs-data/1000021235e.000008dc: (5) Input/output error
> root@lxt-prod-ceph-mon02:~# rados -p cephfs-data get
> "100000f6bd9.000001a3" foo
> root@lxt-prod-ceph-mon02:~#
>
> Which suggested, that indeed the 08dc key is the culprit, and 01a3 is
> probably just the last object that was backfilled (?) - it also told me,
> that the offending object luckily was part of the cephfs-data not the
> cephfs-metadata pool. phew.
>
> So I tried removing the offending key:
>
> root@lxt-prod-ceph-mon02:~# rados -p cephfs-data rm
> "1000021235e.000008dc"
> root@lxt-prod-ceph-mon02:~#
>
> and restarted backfills. And wouldn't you believe it, it restarted
> backfilling without the OSD crashing!
>
> I haven't found a way to find the cephfs file the object belongs to yet,
> so if you can guide me with this, please let me know. But I'm sure sooner
> or later it will make itself known anyway when someone attempts to read it
> *cough*.
>

The object is named after the inode number in hex (1000021235e), and then
which number object is is in the file (also in hex, starting from 0 —
000008dc).
If you look at the zeroth object in that file, it will have a backtrace
xattr which contains an encoded version of the file path. You can use the
ceph-dencoder tool to look at the real data if you have to but just dumping
it as ascii should get you there.
-Greg



> Thanks for your very helpful hint Greg!
>
> Philip
>
> PS: it was one damaged object that prevented me from moving the last
> degraded pg completely off a broken harddisk and prevented the whole
> cluster from being maintainable... that really shouldn't happen in my
> opinion.
>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to