Hi,
On 14/10/2015 06:45, Gregory Farnum wrote:
>> Ok, however during my tests I had been careful to replace the correct
>> file by a bad file with *exactly* the same size (the content of the
>> file was just a little string and I have changed it by a string with
>> exactly the same size). I had been careful to undo the mtime update
>> too (I had restore the mtime of the file before the change). Despite
>> this, the "repair" command worked well. Tested twice: 1. with the change
>> on the primary OSD and 2. on the secondary OSD. And I was surprised
>> because I though the test 1. (in primary OSD) will fail.
>
> Hm. I'm a little confused by that, actually. Exactly what was the path
> to the files you changed, and do you have before-and-after comparisons
> on the content and metadata?
I didn't remember exactly the process I have made so I have just retried
today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
/mnt on one of the nodes.
~# cat /mnt/file.txt # yes it's a little file. ;)
123456
~# ls -i /mnt/file.txt
1099511627776 /mnt/file.txt
~# printf "%x\n" 1099511627776
10000000000
~# rados -p data ls - | grep 10000000000
10000000000.00000000
I have the name of the object mapped to my "file.txt".
~# ceph osd map data 10000000000.00000000
osdmap e76 pool 'data' (3) object '10000000000.00000000' -> pg 3.f0b56f30
(3.30) -> up ([1,2], p1) acting ([1,2], p1)
So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
So I open a terminal in the node which hosts the primary OSD OSD-1 and
then:
~# cat
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
123456
~# ll
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
-rw-r--r-- 1 root root 7 Oct 15 03:46
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
Now, I change the content with this script called "change_content.sh" to
preserve the mtime after the change:
-----------------------------
#!/bin/sh
f="$1"
f_tmp="${f}.tmp"
content="$2"
cp --preserve=all "$f" "$f_tmp"
echo "$content" >"$f"
touch -r "$f_tmp" "$f" # to restore the mtime after the change
rm "$f_tmp"
-----------------------------
So, let's go, I replace the content by a new content with exactly
the same size (ie "ABCDEF" in this example):
~# ./change_content.sh
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
ABCDEF
~# cat
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
ABCDEF
~# ll
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
-rw-r--r-- 1 root root 7 Oct 15 03:46
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
Now, the secondary OSD contains the good version of the object and
the primary a bad version. Now, I launch a "ceph pg repair":
~# ceph pg repair 3.30
instructing pg 3.30 on osd.1 to repair
# I'm in the primary OSD and the file below has been repaired correctly.
~# cat
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
123456
As you can see, the repair command has worked well.
Maybe my little is too trivial?
>> Greg, if I understand you well, I shouldn't have too much confidence in
>> the "ceph pg repair" command, is it correct?
>>
>> But, if yes, what is the good way to repair a PG?
>
> Usually what we recommend is for those with 3 copies to find the
> differing copy, delete it, and run a repair — then you know it'll
> repair from a good version. But yeah, it's not as reliable as we'd
> like it to be on its own.
I would like to be sure to well understand. The process could be (in
the case where size == 3):
1. In each of the 3 OSDs where my object is put:
md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*
2. Normally, I will have the same result in 2 OSDs, and in the other
OSD, let's call it OSD-X, the result will be different. So, in the OSD-X,
I run:
rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*
3. And now I can run the "ceph pg repair" command without risk:
ceph pg repair $pg_id
Is it the correct process?
--
François Lafont
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com