Re: [ceph-users] CephFS file to rados object mapping

Francois Lafont Wed, 14 Oct 2015 19:21:04 -0700

Hi,

On 14/10/2015 06:45, Gregory Farnum wrote:


>> Ok, however during my tests I had been careful to replace the correct
>> file by a bad file with *exactly* the same size (the content of the
>> file was just a little string and I have changed it by a string with
>> exactly the same size). I had been careful to undo the mtime update
>> too (I had restore the mtime of the file before the change). Despite
>> this, the "repair" command worked well. Tested twice: 1. with the change
>> on the primary OSD and 2. on the secondary OSD. And I was surprised
>> because I though the test 1. (in primary OSD) will fail.
> 
> Hm. I'm a little confused by that, actually. Exactly what was the path
> to the files you changed, and do you have before-and-after comparisons
> on the content and metadata?

I didn't remember exactly the process I have made so I have just retried
today. Here is my process. I have a healthy cluster with 3 nodes (Ubuntu
Trusty) and I have ceph Hammer (version 0.94.3). I have mounted cephfs on
/mnt on one of the nodes.

~# cat /mnt/file.txt # yes it's a little file. ;)
123456

~# ls -i /mnt/file.txt 
1099511627776 /mnt/file.txt

~# printf "%x\n" 1099511627776
10000000000

~# rados -p data ls - | grep 10000000000
10000000000.00000000

I have the name of the object mapped to my "file.txt".

~# ceph osd map data 10000000000.00000000
osdmap e76 pool 'data' (3) object '10000000000.00000000' -> pg 3.f0b56f30 
(3.30) -> up ([1,2], p1) acting ([1,2], p1)

So my object is in the primary OSD OSD-1 and in the secondary OSD OSD-2.
So I open a terminal in the node which hosts the primary OSD OSD-1 and
then:

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
 
123456

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
 
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3

Now, I change the content with this script called "change_content.sh" to
preserve the mtime after the change:

-----------------------------
#!/bin/sh

f="$1"
f_tmp="${f}.tmp"
content="$2"
cp --preserve=all "$f" "$f_tmp"
echo "$content" >"$f"
touch -r "$f_tmp" "$f" # to restore the mtime after the change
rm "$f_tmp"
-----------------------------

So, let's go, I replace the content by a new content with exactly
the same size (ie "ABCDEF" in this example):

~# ./change_content.sh 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
 ABCDEF

~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
 
ABCDEF

~# ll 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
 
-rw-r--r-- 1 root root 7 Oct 15 03:46 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3

Now, the secondary OSD contains the good version of the object and
the primary a bad version. Now, I launch a "ceph pg repair":

~# ceph pg repair 3.30
instructing pg 3.30 on osd.1 to repair

# I'm in the primary OSD and the file below has been repaired correctly.
~# cat 
/var/lib/ceph/osd/ceph-1/current/3.30_head/10000000000.00000000__head_F0B56F30__3
 
123456

As you can see, the repair command has worked well.
Maybe my little is too trivial?

>> Greg, if I understand you well, I shouldn't have too much confidence in
>> the "ceph pg repair" command, is it correct?
>>
>> But, if yes, what is the good way to repair a PG?
> 
> Usually what we recommend is for those with 3 copies to find the
> differing copy, delete it, and run a repair — then you know it'll
> repair from a good version. But yeah, it's not as reliable as we'd
> like it to be on its own.

I would like to be sure to well understand. The process could be (in
the case where size == 3):

1. In each of the 3 OSDs where my object is put:

    md5sum /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*

2. Normally, I will have the same result in 2 OSDs, and in the other
OSD, let's call it OSD-X, the result will be different. So, in the OSD-X,
I run:

    rm /var/lib/ceph/osd/ceph-$id/current/${pg_id}_head/${object_name}*

3. And now I can run the "ceph pg repair" command without risk:

    ceph pg repair $pg_id
 
Is it the correct process?

-- 
François Lafont
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS file to rados object mapping

Reply via email to