Re: [ceph-users] cephfs metadata damage and scrub error

Mazzystr Tue, 18 Jul 2017 13:37:19 -0700

Any update to this?  I also have the same problem....

# for i in $(cat pg_dump | grep 'stale+active+clean' | awk {'print $1'});
do echo -n "$i: "; rados list-inconsistent-obj $i; echo; done
107.ff: {"epoch":10762,"inconsistents":[]}
.....
and so on for 49 pg's that I think I had a problem with


# ceph tell mds.ceph damage ls | python -m "json.tool"
2017-07-18 16:28:08.657673 7f766e629700  0 client.1923797 ms_handle_reset
on 192.168.1.10:6800/1268574779
2017-07-18 16:28:08.665693 7f76577fe700  0 client.1923798 ms_handle_reset
on 192.168.1.10:6800/1268574779
[
    {
        "damage_type": "dir_frag",
        "frag": "*",
        "id": 4153356868,
        "ino": 1099511661266
    }
]

# cat dirs_ceph_inodes | grep 1099511661266
1099511661266 drwxr-xr-x 1 chris chris 1 May  4  2017 /mnt/ceph/2017/05/04/

# rm -rf /mnt/ceph/2017/05/04/
rm: cannot remove ‘/mnt/ceph/2017/05/04/’: Directory not empty

# ceph -v
ceph version 11.2.0 (f223e27eeb35991352ebc1f67423d4ebc252adb7)

All osd's are on 11.2.0


I am A OK with losing the directory and the contents...just need to get to
happy pastures

Thanks,
/Chris Callegari



On Tue, May 30, 2017 at 7:42 AM, James Eckersall <[email protected]>
wrote:

> Further to this, we managed to repair the inconsistent PG by comparing the
> object digests and removing the one that didn't match (3 of 4 replicas had
> the same digest, 1 didn't) and then issuing a pg repair and scrub.
> This has removed the inconsistent flag on the PG, however, we are still
> seeing the mds report damage.
>
> We tried removing the damage from the mds with damage rm, then ran a
> recursive stat across the problem directory, but the damage re-appeared.
> Tried dong a scrub_path, but the command returned code -2 and the mds log
> shows that the scrub started and finished less than 1ms later.
>
> Any further help is greatly appreciated.
>
> On 17 May 2017 at 10:58, James Eckersall <[email protected]>
> wrote:
>
>> An update to this.  The cluster has been upgraded to Kraken, but I've
>> still got the same PG reporting inconsistent and the same error message
>> about mds metadata damaged.
>> Can anyone offer any further advice please?
>> If you need output from the ceph-osdomap-tool, could you please explain
>> how to use it?  I haven't been able to find any docs that explain.
>>
>> Thanks
>> J
>>
>> On 3 May 2017 at 14:35, James Eckersall <[email protected]>
>> wrote:
>>
>>> Hi David,
>>>
>>> Thanks for the reply, it's appreciated.
>>> We're going to upgrade the cluster to Kraken and see if that fixes the
>>> metadata issue.
>>>
>>> J
>>>
>>> On 2 May 2017 at 17:00, David Zafman <[email protected]> wrote:
>>>
>>>>
>>>> James,
>>>>
>>>>     You have an omap corruption.  It is likely caused by a bug which
>>>> has already been identified.  A fix for that problem is available but it is
>>>> still pending backport for the next Jewel point release.  All 4 of your
>>>> replicas have different "omap_digest" values.
>>>>
>>>> Instead of the xattrs the ceph-osdomap-tool --command
>>>> dump-objects-with-keys output from OSDs 3, 10, 11, 23 would be interesting
>>>> to compare.
>>>>
>>>> ***WARNING*** Please backup your data before doing any repair attempts.
>>>>
>>>> If you can upgrade to Kraken v11.2.0, it will auto repair the omaps on
>>>> ceph-osd start up.  It will likely still require a ceph pg repair to make
>>>> the 4 replicas consistent with each other.  The final result may be the
>>>> reappearance of removed MDS files in the directory.
>>>>
>>>> If you can recover the data, you could remove the directory entirely
>>>> and rebuild it.  The original bug was triggered during omap deletion
>>>> typically in a large directory which corresponds to an individual unlink in
>>>> cephfs.
>>>>
>>>> If you can build a branch in github to get the newer ceph-osdomap-tool
>>>> you could try to use it to repair the omaps.
>>>>
>>>> David
>>>>
>>>>
>>>> On 5/2/17 5:05 AM, James Eckersall wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm having some issues with a ceph cluster.  It's an 8 node cluster rnning
>>>> Jewel ceph-10.2.7-0.el7.x86_64 on CentOS 7.
>>>> This cluster provides RBDs and a CephFS filesystem to a number of clients.
>>>>
>>>> ceph health detail is showing the following errors:
>>>>
>>>> pg 2.9 is active+clean+inconsistent, acting [3,10,11,23]
>>>> 1 scrub errors
>>>> mds0: Metadata damage detected
>>>>
>>>>
>>>> The pg 2.9 is in the cephfs_metadata pool (id 2).
>>>>
>>>> I've looked at the OSD logs for OSD 3, which is the primary for this PG,
>>>> but the only thing that appears relating to this PG is the following:
>>>>
>>>> log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors
>>>>
>>>> After initiating a ceph pg repair 2.9, I see the following in the primary
>>>> OSD log:
>>>>
>>>> log_channel(cluster) log [ERR] : 2.9 repair 1 errors, 0 fixed
>>>> log_channel(cluster) log [ERR] : 2.9 deep-scrub 1 errors
>>>>
>>>>
>>>> I found the below command in a previous ceph-users post.  Running this
>>>> returns the following:
>>>>
>>>> # rados list-inconsistent-obj 2.9
>>>> {"epoch":23738,"inconsistents":[{"object":{"name":"10000411194.00000000","nspace":"","locator":"","snap":"head","version":14737091},"errors":["omap_digest_mismatch"],"union_shard_errors":[],"selected_object_info":"2:9758b358:::10000411194.00000000:head(33456'14737091
>>>> mds.0.214448:248532 dirty|omap|data_digest s 0 uv 14737091 dd
>>>> ffffffff)","shards":[{"osd":3,"errors":[],"size":0,"omap_digest":"0x6748eef3","data_digest":"0xffffffff"},{"osd":10,"errors":[],"size":0,"omap_digest":"0xa791d5a4","data_digest":"0xffffffff"},{"osd":11,"errors":[],"size":0,"omap_digest":"0x53f46ab0","data_digest":"0xffffffff"},{"osd":23,"errors":[],"size":0,"omap_digest":"0x97b80594","data_digest":"0xffffffff"}]}]}
>>>>
>>>>
>>>> So from this, I think that the object in PG 2.9 with the problem is
>>>> 10000411194.00000000.
>>>>
>>>> This is what I see on the filesystem on the 4 OSD's this PG resides on:
>>>>
>>>> -rw-r--r--. 1 ceph ceph 0 Apr 27 12:31
>>>> /var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> -rw-r--r--. 1 ceph ceph 0 Apr 15 22:05
>>>> /var/lib/ceph/osd/ceph-10/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> -rw-r--r--. 1 ceph ceph 0 Apr 15 22:07
>>>> /var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> -rw-r--r--. 1 ceph ceph 0 Apr 16 03:58
>>>> /var/lib/ceph/osd/ceph-23/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>>
>>>> The extended attrs are as follows, although I have no idea what any of them
>>>> mean.
>>>>
>>>> # file:
>>>> var/lib/ceph/osd/ceph-11/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> user.ceph._=0sDwj5AAAABAM1AAAAAAAAABQAAAAxMDAwMDQxMTE5NC4wMDAwMDAwMP7/////////6RrNGgAAAAAAAgAAAAAAAAAGAxwAAAACAAAAAAAAAP////8AAAAAAAAAAP//////////AAAAABUn4QAAAAAAu4IAAK4m4QAAAAAAu4IAAAICFQAAAAIAAAAAAAAAAOSZDAAAAAAAsEUDAAAAAAAAAAAAjUoIWUgWsQQCAhUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVJ+EAAAAAAAAAAAAAAAAAABwAAACNSghZESm8BP///w==
>>>> user.ceph._@1=0s//////8=
>>>> user.ceph._layout=0sAgIYAAAAAAAAAAAAAAAAAAAA//////////8AAAAA
>>>> user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAAAAAgIjAAAAjxFBAAABAAAPAAAAdHViZWFtYXRldXIubmV0qdgAAAAAAAACAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAAAAAAAAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgAAAAAAAAACAh4AAAAQDUEAAAEAAAoAAAB3cC1jb250ZW50NAMAAAAAAAACAhgAAAANDUEAAAEAAAQAAABodG1sIAEAAAAAAAACAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAAAAAAAAAICMwAAADkAAAAAAQ==
>>>> user.ceph._parent@1
>>>> =0sAAAfAAAANDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCAAAAAAAAgIcAAAAAQAAAAAAAAAIAAAAcHJvamVjdHPBAgcAAAAAAAIAAAAAAAAAAAAAAA==
>>>> user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
>>>> user.cephos.seq=0sAQEQAAAAgcAqFAAAAAAAAAAAAgAAAAA=
>>>> user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
>>>> path names
>>>>
>>>> # file:
>>>> var/lib/ceph/osd/ceph-3/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> user.ceph._=0sDwj5AAAABAM1AAAAAAAAABQAAAAxMDAwMDQxMTE5NC4wMDAwMDAwMP7/////////6RrNGgAAAAAAAgAAAAAAAAAGAxwAAAACAAAAAAAAAP////8AAAAAAAAAAP//////////AAAAABUn4QAAAAAAu4IAAK4m4QAAAAAAu4IAAAICFQAAAAIAAAAAAAAAAOSZDAAAAAAAsEUDAAAAAAAAAAAAjUoIWUgWsQQCAhUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVJ+EAAAAAAAAAAAAAAAAAABwAAACNSghZESm8BP///w==
>>>> user.ceph._@1=0s//////8=
>>>> user.ceph._layout=0sAgIYAAAAAAAAAAAAAAAAAAAA//////////8AAAAA
>>>> user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAAAAAgIjAAAAjxFBAAABAAAPAAAAdHViZWFtYXRldXIubmV0qdgAAAAAAAACAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAAAAAAAAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgAAAAAAAAACAh4AAAAQDUEAAAEAAAoAAAB3cC1jb250ZW50NAMAAAAAAAACAhgAAAANDUEAAAEAAAQAAABodG1sIAEAAAAAAAACAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAAAAAAAAAICMwAAADkAAAAAAQ==
>>>> user.ceph._parent@1
>>>> =0sAAAfAAAANDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCAAAAAAAAgIcAAAAAQAAAAAAAAAIAAAAcHJvamVjdHPBAgcAAAAAAAIAAAAAAAAAAAAAAA==
>>>> user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
>>>> user.cephos.seq=0sAQEQAAAAZaQ9GwAAAAAAAAAAAgAAAAA=
>>>> user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
>>>> path names
>>>>
>>>> # file:
>>>> var/lib/ceph/osd/ceph-10/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> user.ceph._=0sDwj5AAAABAM1AAAAAAAAABQAAAAxMDAwMDQxMTE5NC4wMDAwMDAwMP7/////////6RrNGgAAAAAAAgAAAAAAAAAGAxwAAAACAAAAAAAAAP////8AAAAAAAAAAP//////////AAAAABUn4QAAAAAAu4IAAK4m4QAAAAAAu4IAAAICFQAAAAIAAAAAAAAAAOSZDAAAAAAAsEUDAAAAAAAAAAAAjUoIWUgWsQQCAhUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVJ+EAAAAAAAAAAAAAAAAAABwAAACNSghZESm8BP///w==
>>>> user.ceph._@1=0s//////8=
>>>> user.ceph._layout=0sAgIYAAAAAAAAAAAAAAAAAAAA//////////8AAAAA
>>>> user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAAAAAgIjAAAAjxFBAAABAAAPAAAAdHViZWFtYXRldXIubmV0qdgAAAAAAAACAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAAAAAAAAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgAAAAAAAAACAh4AAAAQDUEAAAEAAAoAAAB3cC1jb250ZW50NAMAAAAAAAACAhgAAAANDUEAAAEAAAQAAABodG1sIAEAAAAAAAACAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAAAAAAAAAICMwAAADkAAAAAAQ==
>>>> user.ceph._parent@1
>>>> =0sAAAfAAAANDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCAAAAAAAAgIcAAAAAQAAAAAAAAAIAAAAcHJvamVjdHPBAgcAAAAAAAIAAAAAAAAAAAAAAA==
>>>> user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
>>>> user.cephos.seq=0sAQEQAAAA1T1dEQAAAAAAAAAAAgAAAAA=
>>>> user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
>>>> path names
>>>>
>>>> # file:
>>>> var/lib/ceph/osd/ceph-23/current/2.9_head/DIR_9/DIR_E/DIR_A/DIR_1/10000411194.00000000__head_1ACD1AE9__2
>>>> user.ceph._=0sDwj5AAAABAM1AAAAAAAAABQAAAAxMDAwMDQxMTE5NC4wMDAwMDAwMP7/////////6RrNGgAAAAAAAgAAAAAAAAAGAxwAAAACAAAAAAAAAP////8AAAAAAAAAAP//////////AAAAABUn4QAAAAAAu4IAAK4m4QAAAAAAu4IAAAICFQAAAAIAAAAAAAAAAOSZDAAAAAAAsEUDAAAAAAAAAAAAjUoIWUgWsQQCAhUAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAVJ+EAAAAAAAAAAAAAAAAAABwAAACNSghZESm8BP///w==
>>>> user.ceph._@1=0s//////8=
>>>> user.ceph._layout=0sAgIYAAAAAAAAAAAAAAAAAAAA//////////8AAAAA
>>>> user.ceph._parent=0sBQRPAQAAlBFBAAABAAAIAAAAAgIjAAAAjxFBAAABAAAPAAAAdHViZWFtYXRldXIubmV0qdgAAAAAAAACAh0AAAB/EUEAAAEAAAkAAAB3cC1yb2NrZXREAAAAAAAAAAICGQAAABYNQQAAAQAABQAAAGNhY2hlUgAAAAAAAAACAh4AAAAQDUEAAAEAAAoAAAB3cC1jb250ZW50NAMAAAAAAAACAhgAAAANDUEAAAEAAAQAAABodG1sIAEAAAAAAAACAikAAADagTMAAAEAABUAAABuZ2lueC1waHA3LWNsdmdmLWRhdGGJAAAAAAAAAAICMwAAADkAAAAAAQ==
>>>> user.ceph._parent@1
>>>> =0sAAAfAAAANDg4LTU3YjI2NTdmMmZhMTMtbWktcHJveWVjdG8tMXSQCAAAAAAAAgIcAAAAAQAAAAAAAAAIAAAAcHJvamVjdHPBAgcAAAAAAAIAAAAAAAAAAAAAAA==
>>>> user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
>>>> user.cephos.seq=0sAQEQAAAADiM7AAAAAAAAAAAAAgAAAAA=
>>>> user.cephos.spill_out=0sMAA=getfattr: Removing leading '/' from absolute
>>>> path names
>>>>
>>>>
>>>> With metadata damage issue, I can get the list of inodes with the command
>>>> below.
>>>>
>>>> $ ceph tell mds.0 damage ls | python -m "json.tool"
>>>> [
>>>>     {
>>>>         "damage_type": "dir_frag",
>>>>         "frag": "*",
>>>>         "id": 5129156,
>>>>         "ino": 1099556021325
>>>>     },
>>>>     {
>>>>         "damage_type": "dir_frag",
>>>>         "frag": "*",
>>>>         "id": 8983971,
>>>>         "ino": 1099548098243
>>>>     },
>>>>     {
>>>>         "damage_type": "dir_frag",
>>>>         "frag": "*",
>>>>         "id": 33278608,
>>>>         "ino": 1099548257921
>>>>     },
>>>>     {
>>>>         "damage_type": "dir_frag",
>>>>         "frag": "*",
>>>>         "id": 33455691,
>>>>         "ino": 1099548271575
>>>>     },
>>>>     {
>>>>         "damage_type": "dir_frag",
>>>>         "frag": "*",
>>>>         "id": 38203788,
>>>>         "ino": 1099548134708
>>>>     },
>>>> ...
>>>>
>>>> All of the inodes (approx 800 of them) are for various directories within a
>>>> wordpress cache directory.
>>>> I ran an rm -rf on each of the directories as I do not need the content.
>>>> The content of the directories was removed, but the directories are unable
>>>> to be removed as rmdir reports they are not empty, despite having 0 files
>>>> listed with ls.
>>>>
>>>> I'm not sure if these two issues are related to each other.  They were
>>>> noticed within a day of each other.  I think the metadata damage error
>>>> appeared before the scrub error.
>>>>
>>>> I'm at a bit of a loss with how to proceed and I don't want to make things
>>>> worse.
>>>>
>>>> I'd really appreciate any help that anyone can give to try and resolve
>>>> these problems.
>>>>
>>>> Thanks
>>>>
>>>> J
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing 
>>>> [email protected]http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected]
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] cephfs metadata damage and scrub error

Reply via email to