[ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I've run into an issue where after copying a file to my cephfs cluster
the md5sums no longer match.  I believe I've tracked it down to some
parts of the file which are missing:

$ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name
| sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/)
$ echo Object name: $obj_name
Object name: 1001120

$ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }')
$ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size
File size: 20074 MiB (21049178117 Bytes)

$ blocks=$((file_size/4194304+1))
$ printf Blocks: %d\n $blocks
Blocks: 5019

$ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
${obj_name}.`printf '%8.8x\n' $b` | grep error; done
 error stat-ing data/1001120.1076: No such file or directory
 error stat-ing data/1001120.11c7: No such file or directory
 error stat-ing data/1001120.129c: No such file or directory
 error stat-ing data/1001120.12f4: No such file or directory
 error stat-ing data/1001120.1307: No such file or directory


Any ideas where to look to investigate what caused these blocks to not
be written?

Here's the current state of the cluster:

ceph -s
   health HEALTH_OK
   monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
   osdmap e22059: 24 osds: 24 up, 24 in
pgmap v1783615: 1920 pgs: 1917 active+clean, 3
active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
13592 GB avail
   mdsmap e437: 1/1/1 up {0=a=up:active}

Here's my current crushmap:

# begin crush map

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23

# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 pool

# buckets
host b1 {
id -2   # do not change unnecessarily
# weight 2.980
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.500
item osd.1 weight 0.500
item osd.2 weight 0.500
item osd.3 weight 0.500
item osd.4 weight 0.500
item osd.20 weight 0.480
}
host b2 {
id -4   # do not change unnecessarily
# weight 4.680
alg straw
hash 0  # rjenkins1
item osd.5 weight 0.500
item osd.6 weight 0.500
item osd.7 weight 2.200
item osd.8 weight 0.500
item osd.9 weight 0.500
item osd.21 weight 0.480
}
host b3 {
id -5   # do not change unnecessarily
# weight 3.480
alg straw
hash 0  # rjenkins1
item osd.10 weight 0.500
item osd.11 weight 0.500
item osd.12 weight 1.000
item osd.13 weight 0.500
item osd.14 weight 0.500
item osd.22 weight 0.480
}
host b4 {
id -6   # do not change unnecessarily
# weight 3.480
alg straw
hash 0  # rjenkins1
item osd.15 weight 0.500
item osd.16 weight 1.000
item osd.17 weight 0.500
item osd.18 weight 0.500
item osd.19 weight 0.500
item osd.23 weight 0.480
}
pool default {
id -1   # do not change unnecessarily
# weight 14.620
alg straw
hash 0  # rjenkins1
item b1 weight 2.980
item b2 weight 4.680
item b3 weight 3.480
item b4 weight 3.480
}

# rules
rule data {
ruleset 0
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule metadata {
ruleset 1
type replicated
min_size 2
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule rbd {
ruleset 2
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map


Thanks,
Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Gregory Farnum
On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
bstillw...@photobucket.com wrote:
 I've run into an issue where after copying a file to my cephfs cluster
 the md5sums no longer match.  I believe I've tracked it down to some
 parts of the file which are missing:

 $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name
 | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/)
 $ echo Object name: $obj_name
 Object name: 1001120

 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }')
 $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size
 File size: 20074 MiB (21049178117 Bytes)

 $ blocks=$((file_size/4194304+1))
 $ printf Blocks: %d\n $blocks
 Blocks: 5019

 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
 ${obj_name}.`printf '%8.8x\n' $b` | grep error; done
  error stat-ing data/1001120.1076: No such file or directory
  error stat-ing data/1001120.11c7: No such file or directory
  error stat-ing data/1001120.129c: No such file or directory
  error stat-ing data/1001120.12f4: No such file or directory
  error stat-ing data/1001120.1307: No such file or directory


 Any ideas where to look to investigate what caused these blocks to not
 be written?

What client are you using to write this? Is it fairly reproducible (so
you could collect logs of it happening)?

Usually the only times I've seen anything like this were when either
the file data was supposed to go into a pool which the client didn't
have write permissions on, or when the RADOS cluster was in bad shape
and so the data never got flushed to disk. Has your cluster been
healthy since you started writing the file out?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com



 Here's the current state of the cluster:

 ceph -s
health HEALTH_OK
monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
osdmap e22059: 24 osds: 24 up, 24 in
 pgmap v1783615: 1920 pgs: 1917 active+clean, 3
 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
 13592 GB avail
mdsmap e437: 1/1/1 up {0=a=up:active}

 Here's my current crushmap:

 # begin crush map

 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 device 4 osd.4
 device 5 osd.5
 device 6 osd.6
 device 7 osd.7
 device 8 osd.8
 device 9 osd.9
 device 10 osd.10
 device 11 osd.11
 device 12 osd.12
 device 13 osd.13
 device 14 osd.14
 device 15 osd.15
 device 16 osd.16
 device 17 osd.17
 device 18 osd.18
 device 19 osd.19
 device 20 osd.20
 device 21 osd.21
 device 22 osd.22
 device 23 osd.23

 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 row
 type 4 room
 type 5 datacenter
 type 6 pool

 # buckets
 host b1 {
 id -2   # do not change unnecessarily
 # weight 2.980
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 0.500
 item osd.1 weight 0.500
 item osd.2 weight 0.500
 item osd.3 weight 0.500
 item osd.4 weight 0.500
 item osd.20 weight 0.480
 }
 host b2 {
 id -4   # do not change unnecessarily
 # weight 4.680
 alg straw
 hash 0  # rjenkins1
 item osd.5 weight 0.500
 item osd.6 weight 0.500
 item osd.7 weight 2.200
 item osd.8 weight 0.500
 item osd.9 weight 0.500
 item osd.21 weight 0.480
 }
 host b3 {
 id -5   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.10 weight 0.500
 item osd.11 weight 0.500
 item osd.12 weight 1.000
 item osd.13 weight 0.500
 item osd.14 weight 0.500
 item osd.22 weight 0.480
 }
 host b4 {
 id -6   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.15 weight 0.500
 item osd.16 weight 1.000
 item osd.17 weight 0.500
 item osd.18 weight 0.500
 item osd.19 weight 0.500
 item osd.23 weight 0.480
 }
 pool default {
 id -1   # do not change unnecessarily
 # weight 14.620
 alg straw
 hash 0  # rjenkins1
 item b1 weight 2.980
 item b2 weight 4.680
 item b3 weight 3.480
 item b4 weight 3.480
 }

 # rules
 rule data {
 ruleset 0
 type replicated
 min_size 2
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }
 rule metadata {
 ruleset 1
 type replicated
 min_size 2
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }
 rule rbd {
 ruleset 2
 type replicated
 min_size 1
 max_size 10
 step take default
 step chooseleaf firstn 0 type host
 step emit
 }

 # end crush map


 Thanks,
 Bryan
 

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
I've tried a few different ones:

1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)

It's fairly reproducible, so I can collect logs for you.  Which ones
would you be interested in?

The cluster has been in a couple states during testing (during
expansion/rebalancing and during an all active+clean state).

BTW, all the nodes are running with the 0.56.4-1precise packages.

Bryan

On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
 bstillw...@photobucket.com wrote:
 I've run into an issue where after copying a file to my cephfs cluster
 the md5sums no longer match.  I believe I've tracked it down to some
 parts of the file which are missing:

 $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name
 | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/)
 $ echo Object name: $obj_name
 Object name: 1001120

 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }')
 $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size
 File size: 20074 MiB (21049178117 Bytes)

 $ blocks=$((file_size/4194304+1))
 $ printf Blocks: %d\n $blocks
 Blocks: 5019

 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
 ${obj_name}.`printf '%8.8x\n' $b` | grep error; done
  error stat-ing data/1001120.1076: No such file or directory
  error stat-ing data/1001120.11c7: No such file or directory
  error stat-ing data/1001120.129c: No such file or directory
  error stat-ing data/1001120.12f4: No such file or directory
  error stat-ing data/1001120.1307: No such file or directory


 Any ideas where to look to investigate what caused these blocks to not
 be written?

 What client are you using to write this? Is it fairly reproducible (so
 you could collect logs of it happening)?

 Usually the only times I've seen anything like this were when either
 the file data was supposed to go into a pool which the client didn't
 have write permissions on, or when the RADOS cluster was in bad shape
 and so the data never got flushed to disk. Has your cluster been
 healthy since you started writing the file out?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com



 Here's the current state of the cluster:

 ceph -s
health HEALTH_OK
monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 a
osdmap e22059: 24 osds: 24 up, 24 in
 pgmap v1783615: 1920 pgs: 1917 active+clean, 3
 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
 13592 GB avail
mdsmap e437: 1/1/1 up {0=a=up:active}

 Here's my current crushmap:

 # begin crush map

 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 device 4 osd.4
 device 5 osd.5
 device 6 osd.6
 device 7 osd.7
 device 8 osd.8
 device 9 osd.9
 device 10 osd.10
 device 11 osd.11
 device 12 osd.12
 device 13 osd.13
 device 14 osd.14
 device 15 osd.15
 device 16 osd.16
 device 17 osd.17
 device 18 osd.18
 device 19 osd.19
 device 20 osd.20
 device 21 osd.21
 device 22 osd.22
 device 23 osd.23

 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 row
 type 4 room
 type 5 datacenter
 type 6 pool

 # buckets
 host b1 {
 id -2   # do not change unnecessarily
 # weight 2.980
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 0.500
 item osd.1 weight 0.500
 item osd.2 weight 0.500
 item osd.3 weight 0.500
 item osd.4 weight 0.500
 item osd.20 weight 0.480
 }
 host b2 {
 id -4   # do not change unnecessarily
 # weight 4.680
 alg straw
 hash 0  # rjenkins1
 item osd.5 weight 0.500
 item osd.6 weight 0.500
 item osd.7 weight 2.200
 item osd.8 weight 0.500
 item osd.9 weight 0.500
 item osd.21 weight 0.480
 }
 host b3 {
 id -5   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.10 weight 0.500
 item osd.11 weight 0.500
 item osd.12 weight 1.000
 item osd.13 weight 0.500
 item osd.14 weight 0.500
 item osd.22 weight 0.480
 }
 host b4 {
 id -6   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.15 weight 0.500
 item osd.16 weight 1.000
 item osd.17 weight 0.500
 item osd.18 weight 0.500
 item osd.19 weight 0.500
 item osd.23 weight 0.480
 }
 pool default {
 id -1   # do not change unnecessarily
 # weight 14.620
 alg straw
 hash 0  # rjenkins1
 item b1 weight 2.980
 item b2 weight 4.680
 item b3 weight 3.480
 item b4 weight 3.480
 }

 # rules
 rule data {
 

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Gregory Farnum
Sorry, I meant kernel client or ceph-fuse? Client logs would be enough
to start with, I suppose — debug client = 20 and debug ms = 1 if
using ceph-fuse; if using the kernel client things get tricker; I'd
have to look at what logging is available without the debugfs stuff
being enabled. :/
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 23, 2013 at 3:00 PM, Bryan Stillwell
bstillw...@photobucket.com wrote:
 I've tried a few different ones:

 1. cp to cephfs mounted filesystem on Ubuntu 12.10 (quantal)
 2. rsync over ssh to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)
 3. scp to cephfs mounted filesystem on Ubuntu 12.04.2 (precise)

 It's fairly reproducible, so I can collect logs for you.  Which ones
 would you be interested in?

 The cluster has been in a couple states during testing (during
 expansion/rebalancing and during an all active+clean state).

 BTW, all the nodes are running with the 0.56.4-1precise packages.

 Bryan

 On Tue, Apr 23, 2013 at 12:56 PM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Apr 23, 2013 at 11:38 AM, Bryan Stillwell
 bstillw...@photobucket.com wrote:
 I've run into an issue where after copying a file to my cephfs cluster
 the md5sums no longer match.  I believe I've tracked it down to some
 parts of the file which are missing:

 $ obj_name=$(cephfs title1.mkv show_location -l 0 | grep object_name
 | sed -e s/.*:\W*\([0-9a-f]*\)\.[0-9a-f]*/\1/)
 $ echo Object name: $obj_name
 Object name: 1001120

 $ file_size=$(stat title1.mkv | grep Size | awk '{ print $2 }')
 $ printf File size: %d MiB (%d Bytes)\n $(($file_size/1048576)) $file_size
 File size: 20074 MiB (21049178117 Bytes)

 $ blocks=$((file_size/4194304+1))
 $ printf Blocks: %d\n $blocks
 Blocks: 5019

 $ for b in `seq 0 $(($blocks-1))`; do rados -p data stat
 ${obj_name}.`printf '%8.8x\n' $b` | grep error; done
  error stat-ing data/1001120.1076: No such file or directory
  error stat-ing data/1001120.11c7: No such file or directory
  error stat-ing data/1001120.129c: No such file or directory
  error stat-ing data/1001120.12f4: No such file or directory
  error stat-ing data/1001120.1307: No such file or directory


 Any ideas where to look to investigate what caused these blocks to not
 be written?

 What client are you using to write this? Is it fairly reproducible (so
 you could collect logs of it happening)?

 Usually the only times I've seen anything like this were when either
 the file data was supposed to go into a pool which the client didn't
 have write permissions on, or when the RADOS cluster was in bad shape
 and so the data never got flushed to disk. Has your cluster been
 healthy since you started writing the file out?
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com



 Here's the current state of the cluster:

 ceph -s
health HEALTH_OK
monmap e1: 1 mons at {a=172.24.88.50:6789/0}, election epoch 1, quorum 0 
 a
osdmap e22059: 24 osds: 24 up, 24 in
 pgmap v1783615: 1920 pgs: 1917 active+clean, 3
 active+clean+scrubbing+deep; 4667 GB data, 9381 GB used, 4210 GB /
 13592 GB avail
mdsmap e437: 1/1/1 up {0=a=up:active}

 Here's my current crushmap:

 # begin crush map

 # devices
 device 0 osd.0
 device 1 osd.1
 device 2 osd.2
 device 3 osd.3
 device 4 osd.4
 device 5 osd.5
 device 6 osd.6
 device 7 osd.7
 device 8 osd.8
 device 9 osd.9
 device 10 osd.10
 device 11 osd.11
 device 12 osd.12
 device 13 osd.13
 device 14 osd.14
 device 15 osd.15
 device 16 osd.16
 device 17 osd.17
 device 18 osd.18
 device 19 osd.19
 device 20 osd.20
 device 21 osd.21
 device 22 osd.22
 device 23 osd.23

 # types
 type 0 osd
 type 1 host
 type 2 rack
 type 3 row
 type 4 room
 type 5 datacenter
 type 6 pool

 # buckets
 host b1 {
 id -2   # do not change unnecessarily
 # weight 2.980
 alg straw
 hash 0  # rjenkins1
 item osd.0 weight 0.500
 item osd.1 weight 0.500
 item osd.2 weight 0.500
 item osd.3 weight 0.500
 item osd.4 weight 0.500
 item osd.20 weight 0.480
 }
 host b2 {
 id -4   # do not change unnecessarily
 # weight 4.680
 alg straw
 hash 0  # rjenkins1
 item osd.5 weight 0.500
 item osd.6 weight 0.500
 item osd.7 weight 2.200
 item osd.8 weight 0.500
 item osd.9 weight 0.500
 item osd.21 weight 0.480
 }
 host b3 {
 id -5   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.10 weight 0.500
 item osd.11 weight 0.500
 item osd.12 weight 1.000
 item osd.13 weight 0.500
 item osd.14 weight 0.500
 item osd.22 weight 0.480
 }
 host b4 {
 id -6   # do not change unnecessarily
 # weight 3.480
 alg straw
 hash 0  # rjenkins1
 item osd.15 weight 0.500
 

Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:

 On Tue, 23 Apr 2013, Bryan Stillwell wrote:
  I'm testing this now, but while going through the logs I saw something
  that might have something to do with this:
 
  Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
  epoch 22146 off 102 (88021e0dc802 of
  88021e0dc79c-88021e0dc802)

 Oh, that's not right...  What kernel version is this?  Which ceph version?

$ uname -a
Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
x86_64 x86_64 x86_64 GNU/Linux
$ ceph -v
ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Sage Weil
On Tue, 23 Apr 2013, Bryan Stillwell wrote:
 On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:
 
  On Tue, 23 Apr 2013, Bryan Stillwell wrote:
   I'm testing this now, but while going through the logs I saw something
   that might have something to do with this:
  
   Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
   epoch 22146 off 102 (88021e0dc802 of
   88021e0dc79c-88021e0dc802)
 
  Oh, that's not right...  What kernel version is this?  Which ceph version?
 
 $ uname -a
 Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
 x86_64 x86_64 x86_64 GNU/Linux

Oh, that's a sufficiently old kernel that we don't support.  3.4 or later 
is considered stable.  You should be able to get recent mainline kernels 
from an ubuntu ppa...

sage

 $ ceph -v
 ceph version 0.56.4 (63b0f854d1cef490624de5d6cf9039735c7de5ca)
 
 Bryan
 
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:45 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 23 Apr 2013, Bryan Stillwell wrote:
 On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:
 
  On Tue, 23 Apr 2013, Bryan Stillwell wrote:
   I'm testing this now, but while going through the logs I saw something
   that might have something to do with this:
  
   Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
   epoch 22146 off 102 (88021e0dc802 of
   88021e0dc79c-88021e0dc802)
 
  Oh, that's not right...  What kernel version is this?  Which ceph version?

 $ uname -a
 Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
 x86_64 x86_64 x86_64 GNU/Linux

 Oh, that's a sufficiently old kernel that we don't support.  3.4 or later
 is considered stable.  You should be able to get recent mainline kernels
 from an ubuntu ppa...

It looks like Canonical released a 3.5.0 kernel as a security update
to precise that I'll give a try.

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Corruption by missing blocks

2013-04-23 Thread Bryan Stillwell
On Tue, Apr 23, 2013 at 5:54 PM, Gregory Farnum g...@inktank.com wrote:
 On Tue, Apr 23, 2013 at 4:45 PM, Sage Weil s...@inktank.com wrote:
 On Tue, 23 Apr 2013, Bryan Stillwell wrote:
 On Tue, Apr 23, 2013 at 5:24 PM, Sage Weil s...@inktank.com wrote:
 
  On Tue, 23 Apr 2013, Bryan Stillwell wrote:
   I'm testing this now, but while going through the logs I saw something
   that might have something to do with this:
  
   Apr 23 16:35:28 a1 kernel: [692455.496594] libceph: corrupt inc osdmap
   epoch 22146 off 102 (88021e0dc802 of
   88021e0dc79c-88021e0dc802)
 
  Oh, that's not right...  What kernel version is this?  Which ceph version?

 $ uname -a
 Linux a1 3.2.0-39-generic #62-Ubuntu SMP Thu Feb 28 00:28:53 UTC 2013
 x86_64 x86_64 x86_64 GNU/Linux

 Oh, that's a sufficiently old kernel that we don't support.  3.4 or later
 is considered stable.  You should be able to get recent mainline kernels
 from an ubuntu ppa...

 By which he means that could have caused the trouble and there are
 some osdmap decoding problems which are fixed in later kernels. :)
 I'd forgotten about these problems, although fortunately they're not
 consistent. But especially for CephFS you'll want to stick with
 userspace rather than kernelspace for a while if you aren't in the
 habit of staying very up-to-date.

Thanks, that's good to know.  :)

The first copy test using fuse finished and the MD5s match up!  I'm
going to do some more testing overnight, but this seems to be the
cause.

Thanks for the help!

Bryan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com