[ceph-users] Corruption by missing blocks
I've run into an issue where after copying a file to my cephfs cluster the md5sums no longer match. I believe I've tracked it down to some parts of the file which are missing: $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/) $ echo Object name: $obj_name Object name: 1001120 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }') $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size File size: 20074 MiB (21049178117 Bytes) $ blocks=$((file_size/4194304+1)) $ printf Blocks: %d\n $blocks Blocks: 5019 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat ${obj_name}.`printf '%8.8x\n' $b` | grep error; done error stat-ing data/1001120.1076: No such file or directory error stat-ing data/1001120.11c7: No such file or directory error stat-ing data/1001120.129c: No such file or directory error stat-ing data/1001120.12f4: No such file or directory error stat-ing data/1001120.1307: No such file or directory Any ideas where to look to investigate what caused these blocks to not be written? Here's the current state of the cluster: ceph -s health HEALTH_OK monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a osdmap e22059: 24 osds: 24 up, 24 in pgmap v1783615: 1920 pgs: 1917 active+clean, 3 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB / 13592 GB avail mdsmap e437: 1/1/1 up {0=a=up:active} Here's my current crushmap: # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host b1 { id -2 # do not change unnecessarily # weight 2.980 alg straw hash 0 # rjenkins1 item osd.0 weight 0.500 item osd.1 weight 0.500 item osd.2 weight 0.500 item osd.3 weight 0.500 item osd.4 weight 0.500 item osd.20 weight 0.480 } host b2 { id -4 # do not change unnecessarily # weight 4.680 alg straw hash 0 # rjenkins1 item osd.5 weight 0.500 item osd.6 weight 0.500 item osd.7 weight 2.200 item osd.8 weight 0.500 item osd.9 weight 0.500 item osd.21 weight 0.480 } host b3 { id -5 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.10 weight 0.500 item osd.11 weight 0.500 item osd.12 weight 1.000 item osd.13 weight 0.500 item osd.14 weight 0.500 item osd.22 weight 0.480 } host b4 { id -6 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.15 weight 0.500 item osd.16 weight 1.000 item osd.17 weight 0.500 item osd.18 weight 0.500 item osd.19 weight 0.500 item osd.23 weight 0.480 } pool default { id -1 # do not change unnecessarily # weight 14.620 alg straw hash 0 # rjenkins1 item b1 weight 2.980 item b2 weight 4.680 item b3 weight 3.480 item b4 weight 3.480 } # rules rule data { ruleset 0 type replicated min_size 2 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 2 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Thanks, Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Corruption by missing blocks
On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell bstillw...@photobucket.com wrote: I've run into an issue where after copying a file to my cephfs cluster the md5sums no longer match. I believe I've tracked it down to some parts of the file which are missing: $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/) $ echo Object name: $obj_name Object name: 1001120 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }') $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size File size: 20074 MiB (21049178117 Bytes) $ blocks=$((file_size/4194304+1)) $ printf Blocks: %d\n $blocks Blocks: 5019 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat ${obj_name}.`printf '%8.8x\n' $b` | grep error; done error stat-ing data/1001120.1076: No such file or directory error stat-ing data/1001120.11c7: No such file or directory error stat-ing data/1001120.129c: No such file or directory error stat-ing data/1001120.12f4: No such file or directory error stat-ing data/1001120.1307: No such file or directory Any ideas where to look to investigate what caused these blocks to not be written? What client are you using to write this? Is it fairly reproducible (so you could collect logs of it happening)? Usually the only times I've seen anything like this were when either the file data was supposed to go into a pool which the client didn't have write permissions on, or when the RADOS cluster was in bad shape and so the data never got flushed to disk. Has your cluster been healthy since you started writing the file out? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Here's the current state of the cluster: ceph -s health HEALTH_OK monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a osdmap e22059: 24 osds: 24 up, 24 in pgmap v1783615: 1920 pgs: 1917 active+clean, 3 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB / 13592 GB avail mdsmap e437: 1/1/1 up {0=a=up:active} Here's my current crushmap: # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host b1 { id -2 # do not change unnecessarily # weight 2.980 alg straw hash 0 # rjenkins1 item osd.0 weight 0.500 item osd.1 weight 0.500 item osd.2 weight 0.500 item osd.3 weight 0.500 item osd.4 weight 0.500 item osd.20 weight 0.480 } host b2 { id -4 # do not change unnecessarily # weight 4.680 alg straw hash 0 # rjenkins1 item osd.5 weight 0.500 item osd.6 weight 0.500 item osd.7 weight 2.200 item osd.8 weight 0.500 item osd.9 weight 0.500 item osd.21 weight 0.480 } host b3 { id -5 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.10 weight 0.500 item osd.11 weight 0.500 item osd.12 weight 1.000 item osd.13 weight 0.500 item osd.14 weight 0.500 item osd.22 weight 0.480 } host b4 { id -6 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.15 weight 0.500 item osd.16 weight 1.000 item osd.17 weight 0.500 item osd.18 weight 0.500 item osd.19 weight 0.500 item osd.23 weight 0.480 } pool default { id -1 # do not change unnecessarily # weight 14.620 alg straw hash 0 # rjenkins1 item b1 weight 2.980 item b2 weight 4.680 item b3 weight 3.480 item b4 weight 3.480 } # rules rule data { ruleset 0 type replicated min_size 2 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule metadata { ruleset 1 type replicated min_size 2 max_size 10 step take default step chooseleaf firstn 0 type host step emit } rule rbd { ruleset 2 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } # end crush map Thanks, Bryan
Re: [ceph-users] Corruption by missing blocks
I've tried a few different ones: 1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal) 2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise) 3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise) It's fairly reproducible, so I can collect logs for you. Which ones would you be interested in? The cluster has been in a couple states during testing (during expansion/rebalancing and during an all active+clean state). BTW, all the nodes are running with the 0.56.4-1precise packages. Bryan On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell bstillw...@photobucket.com wrote: I've run into an issue where after copying a file to my cephfs cluster the md5sums no longer match. I believe I've tracked it down to some parts of the file which are missing: $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/) $ echo Object name: $obj_name Object name: 1001120 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }') $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size File size: 20074 MiB (21049178117 Bytes) $ blocks=$((file_size/4194304+1)) $ printf Blocks: %d\n $blocks Blocks: 5019 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat ${obj_name}.`printf '%8.8x\n' $b` | grep error; done error stat-ing data/1001120.1076: No such file or directory error stat-ing data/1001120.11c7: No such file or directory error stat-ing data/1001120.129c: No such file or directory error stat-ing data/1001120.12f4: No such file or directory error stat-ing data/1001120.1307: No such file or directory Any ideas where to look to investigate what caused these blocks to not be written? What client are you using to write this? Is it fairly reproducible (so you could collect logs of it happening)? Usually the only times I've seen anything like this were when either the file data was supposed to go into a pool which the client didn't have write permissions on, or when the RADOS cluster was in bad shape and so the data never got flushed to disk. Has your cluster been healthy since you started writing the file out? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Here's the current state of the cluster: ceph -s health HEALTH_OK monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a osdmap e22059: 24 osds: 24 up, 24 in pgmap v1783615: 1920 pgs: 1917 active+clean, 3 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB / 13592 GB avail mdsmap e437: 1/1/1 up {0=a=up:active} Here's my current crushmap: # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host b1 { id -2 # do not change unnecessarily # weight 2.980 alg straw hash 0 # rjenkins1 item osd.0 weight 0.500 item osd.1 weight 0.500 item osd.2 weight 0.500 item osd.3 weight 0.500 item osd.4 weight 0.500 item osd.20 weight 0.480 } host b2 { id -4 # do not change unnecessarily # weight 4.680 alg straw hash 0 # rjenkins1 item osd.5 weight 0.500 item osd.6 weight 0.500 item osd.7 weight 2.200 item osd.8 weight 0.500 item osd.9 weight 0.500 item osd.21 weight 0.480 } host b3 { id -5 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.10 weight 0.500 item osd.11 weight 0.500 item osd.12 weight 1.000 item osd.13 weight 0.500 item osd.14 weight 0.500 item osd.22 weight 0.480 } host b4 { id -6 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.15 weight 0.500 item osd.16 weight 1.000 item osd.17 weight 0.500 item osd.18 weight 0.500 item osd.19 weight 0.500 item osd.23 weight 0.480 } pool default { id -1 # do not change unnecessarily # weight 14.620 alg straw hash 0 # rjenkins1 item b1 weight 2.980 item b2 weight 4.680 item b3 weight 3.480 item b4 weight 3.480 } # rules rule data {
Re: [ceph-users] Corruption by missing blocks
Sorry, I meant kernel client or ceph-fuse? Client logs would be enough to start with, I suppose — debug client = 20 and debug ms = 1 if using ceph-fuse; if using the kernel client things get tricker; I'd have to look at what logging is available without the debugfs stuff being enabled. :/ -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Tue, Apr 23, 2013 at 3:00 PM, Bryan Stillwell bstillw...@photobucket.com wrote: I've tried a few different ones: 1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal) 2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise) 3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise) It's fairly reproducible, so I can collect logs for you. Which ones would you be interested in? The cluster has been in a couple states during testing (during expansion/rebalancing and during an all active+clean state). BTW, all the nodes are running with the 0.56.4-1precise packages. Bryan On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell bstillw...@photobucket.com wrote: I've run into an issue where after copying a file to my cephfs cluster the md5sums no longer match. I believe I've tracked it down to some parts of the file which are missing: $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/) $ echo Object name: $obj_name Object name: 1001120 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }') $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size File size: 20074 MiB (21049178117 Bytes) $ blocks=$((file_size/4194304+1)) $ printf Blocks: %d\n $blocks Blocks: 5019 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat ${obj_name}.`printf '%8.8x\n' $b` | grep error; done error stat-ing data/1001120.1076: No such file or directory error stat-ing data/1001120.11c7: No such file or directory error stat-ing data/1001120.129c: No such file or directory error stat-ing data/1001120.12f4: No such file or directory error stat-ing data/1001120.1307: No such file or directory Any ideas where to look to investigate what caused these blocks to not be written? What client are you using to write this? Is it fairly reproducible (so you could collect logs of it happening)? Usually the only times I've seen anything like this were when either the file data was supposed to go into a pool which the client didn't have write permissions on, or when the RADOS cluster was in bad shape and so the data never got flushed to disk. Has your cluster been healthy since you started writing the file out? -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com Here's the current state of the cluster: ceph -s health HEALTH_OK monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a osdmap e22059: 24 osds: 24 up, 24 in pgmap v1783615: 1920 pgs: 1917 active+clean, 3 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB / 13592 GB avail mdsmap e437: 1/1/1 up {0=a=up:active} Here's my current crushmap: # begin crush map # devices device 0 osd.0 device 1 osd.1 device 2 osd.2 device 3 osd.3 device 4 osd.4 device 5 osd.5 device 6 osd.6 device 7 osd.7 device 8 osd.8 device 9 osd.9 device 10 osd.10 device 11 osd.11 device 12 osd.12 device 13 osd.13 device 14 osd.14 device 15 osd.15 device 16 osd.16 device 17 osd.17 device 18 osd.18 device 19 osd.19 device 20 osd.20 device 21 osd.21 device 22 osd.22 device 23 osd.23 # types type 0 osd type 1 host type 2 rack type 3 row type 4 room type 5 datacenter type 6 pool # buckets host b1 { id -2 # do not change unnecessarily # weight 2.980 alg straw hash 0 # rjenkins1 item osd.0 weight 0.500 item osd.1 weight 0.500 item osd.2 weight 0.500 item osd.3 weight 0.500 item osd.4 weight 0.500 item osd.20 weight 0.480 } host b2 { id -4 # do not change unnecessarily # weight 4.680 alg straw hash 0 # rjenkins1 item osd.5 weight 0.500 item osd.6 weight 0.500 item osd.7 weight 2.200 item osd.8 weight 0.500 item osd.9 weight 0.500 item osd.21 weight 0.480 } host b3 { id -5 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.10 weight 0.500 item osd.11 weight 0.500 item osd.12 weight 1.000 item osd.13 weight 0.500 item osd.14 weight 0.500 item osd.22 weight 0.480 } host b4 { id -6 # do not change unnecessarily # weight 3.480 alg straw hash 0 # rjenkins1 item osd.15 weight 0.500
Re: [ceph-users] Corruption by missing blocks
On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote: On Tue, 23 Apr 2013, Bryan Stillwell wrote: I'm testing this now, but while going through the logs I saw something that might have something to do with this: Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap epoch 22146 off 102 (88021e0dc802 of 88021e0dc79c-88021e0dc802) Oh, that's not right... What kernel version is this? Which ceph version? $ uname -a Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux $ ceph -v ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca) Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Corruption by missing blocks
On Tue, 23 Apr 2013, Bryan Stillwell wrote: On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote: On Tue, 23 Apr 2013, Bryan Stillwell wrote: I'm testing this now, but while going through the logs I saw something that might have something to do with this: Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap epoch 22146 off 102 (88021e0dc802 of 88021e0dc79c-88021e0dc802) Oh, that's not right... What kernel version is this? Which ceph version? $ uname -a Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Oh, that's a sufficiently old kernel that we don't support. 3.4 or later is considered stable. You should be able to get recent mainline kernels from an ubuntu ppa... sage $ ceph -v ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca) Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Corruption by missing blocks
On Tue, Apr 23, 2013 at 5:45 PM, Sage Weil s...@inktank.com wrote: On Tue, 23 Apr 2013, Bryan Stillwell wrote: On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote: On Tue, 23 Apr 2013, Bryan Stillwell wrote: I'm testing this now, but while going through the logs I saw something that might have something to do with this: Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap epoch 22146 off 102 (88021e0dc802 of 88021e0dc79c-88021e0dc802) Oh, that's not right... What kernel version is this? Which ceph version? $ uname -a Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Oh, that's a sufficiently old kernel that we don't support. 3.4 or later is considered stable. You should be able to get recent mainline kernels from an ubuntu ppa... It looks like Canonical released a 3.5.0 kernel as a security update to precise that I'll give a try. Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Corruption by missing blocks
On Tue, Apr 23, 2013 at 5:54 PM, Gregory Farnum g...@inktank.com wrote: On Tue, Apr 23, 2013 at 4:45 PM, Sage Weil s...@inktank.com wrote: On Tue, 23 Apr 2013, Bryan Stillwell wrote: On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote: On Tue, 23 Apr 2013, Bryan Stillwell wrote: I'm testing this now, but while going through the logs I saw something that might have something to do with this: Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap epoch 22146 off 102 (88021e0dc802 of 88021e0dc79c-88021e0dc802) Oh, that's not right... What kernel version is this? Which ceph version? $ uname -a Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux Oh, that's a sufficiently old kernel that we don't support. 3.4 or later is considered stable. You should be able to get recent mainline kernels from an ubuntu ppa... By which he means that could have caused the trouble and there are some osdmap decoding problems which are fixed in later kernels. :) I'd forgotten about these problems, although fortunately they're not consistent. But especially for CephFS you'll want to stick with userspace rather than kernelspace for a while if you aren't in the habit of staying very up-to-date. Thanks, that's good to know. :) The first copy test using fuse finished and the MD5s match up! I'm going to do some more testing overnight, but this seems to be the cause. Thanks for the help! Bryan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com