Re: [ceph-users] scrub error: found clone without head
Hi, sorry for the late answer : trying to fix that, I tried to delete the image (rbd rm XXX), the rbd rm complete without errors, but rbd ls still display this image. What should I do ? Here the files for the PG 3.6b : # find /var/lib/ceph/osd/ceph-28/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l -rw-r--r-- 1 root root 4194304 19 mai 22:52 /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3 -rw-r--r-- 1 root root 4194304 19 mai 23:00 /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3 -rw-r--r-- 1 root root 4194304 19 mai 22:59 /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3 # find /var/lib/ceph/osd/ceph-23/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l -rw-r--r-- 1 root root 4194304 25 mars 19:18 /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3 -rw-r--r-- 1 root root 4194304 25 mars 19:33 /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3 -rw-r--r-- 1 root root 4194304 25 mars 19:34 /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3 # find /var/lib/ceph/osd/ceph-5/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l -rw-r--r-- 1 root root 4194304 25 mars 19:18 /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3 -rw-r--r-- 1 root root 4194304 25 mars 19:33 /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3 -rw-r--r-- 1 root root 4194304 25 mars 19:34 /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3 As you can see, OSD doesn't contain any other data on thoses PG for this RBD image. Should I remove them thought rados ? In fact I remember that some of thoses files was truncated (size 0), then I manually copy data from osd-5. It was probably an error to do that. Thanks, Olivier Le jeudi 23 mai 2013 à 15:53 -0700, Samuel Just a écrit : Can you send the filenames in the pg directories for those 4 pgs? -Sam On Thu, May 23, 2013 at 3:27 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: No : pg 3.7c is active+clean+inconsistent, acting [24,13,39] pg 3.6b is active+clean+inconsistent, acting [28,23,5] pg 3.d is active+clean+inconsistent, acting [29,4,11] pg 3.1 is active+clean+inconsistent, acting [28,19,5] But I suppose that all PG *was* having the osd.25 as primary (on the same host), which is (disabled) buggy OSD. Question : 12d7 in object path is the snapshot id, right ? If it's the case, I haven't got any snapshot with this id for the rb.0.15c26.238e1f29 image. So, which files should I remove ? Thanks for your help. Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit : Do all of the affected PGs share osd.28 as the primary? I think the only recovery is probably to manually remove the orphaned clones. -Sam On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail. Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit : What version are you running? -Sam On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Is it enough ? # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without head' 2013-05-22 15:43:09.308352 7f707dd64700 0 log [INF] : 9.105 scrub ok 2013-05-22 15:44:21.054893 7f707dd64700 0 log [INF] : 9.451 scrub ok 2013-05-22 15:44:52.898784 7f707cd62700 0 log [INF] : 9.784 scrub ok 2013-05-22 15:47:43.148515 7f707cd62700 0 log [INF] : 9.3c3 scrub ok 2013-05-22 15:47:45.717085 7f707dd64700 0 log [INF] : 9.3d0 scrub ok 2013-05-22 15:52:14.573815 7f707dd64700 0 log [ERR] : scrub 3.6b ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head 2013-05-22 15:55:07.230114 7f707d563700 0 log [ERR] : scrub 3.6b 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head 2013-05-22 15:56:56.456242 7f707d563700 0 log [ERR] : scrub 3.6b b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head 2013-05-22 15:57:51.667085 7f707dd64700 0 log
Re: [ceph-users] scrub error: found clone without head
Note that I still have scrub errors, but rados doesn't see thoses objects : root! brontes:~# rados -p hdd3copies ls | grep '^rb.0.15c26.238e1f29' root! brontes:~# Le vendredi 31 mai 2013 à 15:36 +0200, Olivier Bonvalet a écrit : Hi, sorry for the late answer : trying to fix that, I tried to delete the image (rbd rm XXX), the rbd rm complete without errors, but rbd ls still display this image. What should I do ? Here the files for the PG 3.6b : # find /var/lib/ceph/osd/ceph-28/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l -rw-r--r-- 1 root root 4194304 19 mai 22:52 /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3 -rw-r--r-- 1 root root 4194304 19 mai 23:00 /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3 -rw-r--r-- 1 root root 4194304 19 mai 22:59 /var/lib/ceph/osd/ceph-28/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3 # find /var/lib/ceph/osd/ceph-23/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l -rw-r--r-- 1 root root 4194304 25 mars 19:18 /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3 -rw-r--r-- 1 root root 4194304 25 mars 19:33 /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3 -rw-r--r-- 1 root root 4194304 25 mars 19:34 /var/lib/ceph/osd/ceph-23/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3 # find /var/lib/ceph/osd/ceph-5/current/3.6b_head/ -name 'rb.0.15c26.238e1f29*' -print0 | xargs -r -0 ls -l -rw-r--r-- 1 root root 4194304 25 mars 19:18 /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_6/DIR_1/DIR_C/rb.0.15c26.238e1f29.9221__12d7_ADE3C16B__3 -rw-r--r-- 1 root root 4194304 25 mars 19:33 /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_0/DIR_C/rb.0.15c26.238e1f29.3671__12d7_261CC0EB__3 -rw-r--r-- 1 root root 4194304 25 mars 19:34 /var/lib/ceph/osd/ceph-5/current/3.6b_head/DIR_B/DIR_E/DIR_A/DIR_E/rb.0.15c26.238e1f29.86a2__12d7_B10DEAEB__3 As you can see, OSD doesn't contain any other data on thoses PG for this RBD image. Should I remove them thought rados ? In fact I remember that some of thoses files was truncated (size 0), then I manually copy data from osd-5. It was probably an error to do that. Thanks, Olivier Le jeudi 23 mai 2013 à 15:53 -0700, Samuel Just a écrit : Can you send the filenames in the pg directories for those 4 pgs? -Sam On Thu, May 23, 2013 at 3:27 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: No : pg 3.7c is active+clean+inconsistent, acting [24,13,39] pg 3.6b is active+clean+inconsistent, acting [28,23,5] pg 3.d is active+clean+inconsistent, acting [29,4,11] pg 3.1 is active+clean+inconsistent, acting [28,19,5] But I suppose that all PG *was* having the osd.25 as primary (on the same host), which is (disabled) buggy OSD. Question : 12d7 in object path is the snapshot id, right ? If it's the case, I haven't got any snapshot with this id for the rb.0.15c26.238e1f29 image. So, which files should I remove ? Thanks for your help. Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit : Do all of the affected PGs share osd.28 as the primary? I think the only recovery is probably to manually remove the orphaned clones. -Sam On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail. Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit : What version are you running? -Sam On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Is it enough ? # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without head' 2013-05-22 15:43:09.308352 7f707dd64700 0 log [INF] : 9.105 scrub ok 2013-05-22 15:44:21.054893 7f707dd64700 0 log [INF] : 9.451 scrub ok 2013-05-22 15:44:52.898784 7f707cd62700 0 log [INF] : 9.784 scrub ok 2013-05-22 15:47:43.148515 7f707cd62700 0 log [INF] : 9.3c3 scrub ok 2013-05-22 15:47:45.717085 7f707dd64700 0 log [INF] : 9.3d0 scrub ok 2013-05-22 15:52:14.573815 7f707dd64700 0 log [ERR] : scrub 3.6b ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head
Re: [ceph-users] scrub error: found clone without head
Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail. Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit : What version are you running? -Sam On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Is it enough ? # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without head' 2013-05-22 15:43:09.308352 7f707dd64700 0 log [INF] : 9.105 scrub ok 2013-05-22 15:44:21.054893 7f707dd64700 0 log [INF] : 9.451 scrub ok 2013-05-22 15:44:52.898784 7f707cd62700 0 log [INF] : 9.784 scrub ok 2013-05-22 15:47:43.148515 7f707cd62700 0 log [INF] : 9.3c3 scrub ok 2013-05-22 15:47:45.717085 7f707dd64700 0 log [INF] : 9.3d0 scrub ok 2013-05-22 15:52:14.573815 7f707dd64700 0 log [ERR] : scrub 3.6b ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head 2013-05-22 15:55:07.230114 7f707d563700 0 log [ERR] : scrub 3.6b 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head 2013-05-22 15:56:56.456242 7f707d563700 0 log [ERR] : scrub 3.6b b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head 2013-05-22 15:57:51.667085 7f707dd64700 0 log [ERR] : 3.6b scrub 3 errors 2013-05-22 15:57:55.241224 7f707dd64700 0 log [INF] : 9.450 scrub ok 2013-05-22 15:57:59.800383 7f707cd62700 0 log [INF] : 9.465 scrub ok 2013-05-22 15:59:55.024065 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 cs=73 l=0).fault with nothing to send, going to standby 2013-05-22 16:01:45.542579 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 74 vs existing 73 state standby -- 2013-05-22 16:29:49.544310 7f707dd64700 0 log [INF] : 9.4eb scrub ok 2013-05-22 16:29:53.190233 7f707dd64700 0 log [INF] : 9.4f4 scrub ok 2013-05-22 16:29:59.478736 7f707dd64700 0 log [INF] : 8.6bb scrub ok 2013-05-22 16:35:12.240246 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 l=0).fault with nothing to send, going to standby 2013-05-22 16:35:19.519019 7f707d563700 0 log [INF] : 8.700 scrub ok 2013-05-22 16:39:15.422532 7f707dd64700 0 log [ERR] : scrub 3.1 b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without head 2013-05-22 16:40:04.995256 7f707cd62700 0 log [ERR] : scrub 3.1 bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without head 2013-05-22 16:41:07.008717 7f707d563700 0 log [ERR] : scrub 3.1 8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without head 2013-05-22 16:41:42.460280 7f707c561700 0 log [ERR] : 3.1 scrub 3 errors 2013-05-22 16:46:12.385678 7f7077735700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 76 vs existing 75 state standby 2013-05-22 16:58:36.079010 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 40 vs existing 39 state standby 2013-05-22 16:58:36.798038 7f707d563700 0 log [INF] : 9.50c scrub ok 2013-05-22 16:58:40.104159 7f707c561700 0 log [INF] : 9.526 scrub ok Note : I have 8 scrub errors like that, on 4 impacted PG, and all impacted objects are about the same RBD image (rb.0.15c26.238e1f29). Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just a écrit : Can you post your ceph.log with the period including all of these errors? -Sam On Wed, May 22, 2013 at 5:39 AM, Dzianis Kahanovich maha...@bspu.unibel.by wrote: Olivier Bonvalet пишет: Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit : Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause data lost. I have same problem : 8 objects (4 PG) with error found clone without head. How can I fix that ? since pg repair doesn't handle that kind of errors, is there a way to manually fix that ? (it's a production cluster) Trying to fix manually I cause assertions in trimming process (died OSD). And many others troubles. So, if you want to keep cluster running, wait for developers answer. IMHO. About manual repair attempt: see issue #4937. Also
Re: [ceph-users] scrub error: found clone without head
Do all of the affected PGs share osd.28 as the primary? I think the only recovery is probably to manually remove the orphaned clones. -Sam On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail. Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit : What version are you running? -Sam On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Is it enough ? # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without head' 2013-05-22 15:43:09.308352 7f707dd64700 0 log [INF] : 9.105 scrub ok 2013-05-22 15:44:21.054893 7f707dd64700 0 log [INF] : 9.451 scrub ok 2013-05-22 15:44:52.898784 7f707cd62700 0 log [INF] : 9.784 scrub ok 2013-05-22 15:47:43.148515 7f707cd62700 0 log [INF] : 9.3c3 scrub ok 2013-05-22 15:47:45.717085 7f707dd64700 0 log [INF] : 9.3d0 scrub ok 2013-05-22 15:52:14.573815 7f707dd64700 0 log [ERR] : scrub 3.6b ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head 2013-05-22 15:55:07.230114 7f707d563700 0 log [ERR] : scrub 3.6b 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head 2013-05-22 15:56:56.456242 7f707d563700 0 log [ERR] : scrub 3.6b b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head 2013-05-22 15:57:51.667085 7f707dd64700 0 log [ERR] : 3.6b scrub 3 errors 2013-05-22 15:57:55.241224 7f707dd64700 0 log [INF] : 9.450 scrub ok 2013-05-22 15:57:59.800383 7f707cd62700 0 log [INF] : 9.465 scrub ok 2013-05-22 15:59:55.024065 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 cs=73 l=0).fault with nothing to send, going to standby 2013-05-22 16:01:45.542579 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 74 vs existing 73 state standby -- 2013-05-22 16:29:49.544310 7f707dd64700 0 log [INF] : 9.4eb scrub ok 2013-05-22 16:29:53.190233 7f707dd64700 0 log [INF] : 9.4f4 scrub ok 2013-05-22 16:29:59.478736 7f707dd64700 0 log [INF] : 8.6bb scrub ok 2013-05-22 16:35:12.240246 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 l=0).fault with nothing to send, going to standby 2013-05-22 16:35:19.519019 7f707d563700 0 log [INF] : 8.700 scrub ok 2013-05-22 16:39:15.422532 7f707dd64700 0 log [ERR] : scrub 3.1 b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without head 2013-05-22 16:40:04.995256 7f707cd62700 0 log [ERR] : scrub 3.1 bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without head 2013-05-22 16:41:07.008717 7f707d563700 0 log [ERR] : scrub 3.1 8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without head 2013-05-22 16:41:42.460280 7f707c561700 0 log [ERR] : 3.1 scrub 3 errors 2013-05-22 16:46:12.385678 7f7077735700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 76 vs existing 75 state standby 2013-05-22 16:58:36.079010 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 40 vs existing 39 state standby 2013-05-22 16:58:36.798038 7f707d563700 0 log [INF] : 9.50c scrub ok 2013-05-22 16:58:40.104159 7f707c561700 0 log [INF] : 9.526 scrub ok Note : I have 8 scrub errors like that, on 4 impacted PG, and all impacted objects are about the same RBD image (rb.0.15c26.238e1f29). Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just a écrit : Can you post your ceph.log with the period including all of these errors? -Sam On Wed, May 22, 2013 at 5:39 AM, Dzianis Kahanovich maha...@bspu.unibel.by wrote: Olivier Bonvalet пишет: Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit : Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause data lost. I have same problem : 8 objects (4 PG) with error found clone without head. How can I fix that ? since pg repair doesn't handle that kind of errors, is there a way to manually fix that ? (it's a production cluster) Trying to fix manually I cause
Re: [ceph-users] scrub error: found clone without head
No : pg 3.7c is active+clean+inconsistent, acting [24,13,39] pg 3.6b is active+clean+inconsistent, acting [28,23,5] pg 3.d is active+clean+inconsistent, acting [29,4,11] pg 3.1 is active+clean+inconsistent, acting [28,19,5] But I suppose that all PG *was* having the osd.25 as primary (on the same host), which is (disabled) buggy OSD. Question : 12d7 in object path is the snapshot id, right ? If it's the case, I haven't got any snapshot with this id for the rb.0.15c26.238e1f29 image. So, which files should I remove ? Thanks for your help. Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit : Do all of the affected PGs share osd.28 as the primary? I think the only recovery is probably to manually remove the orphaned clones. -Sam On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail. Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit : What version are you running? -Sam On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Is it enough ? # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without head' 2013-05-22 15:43:09.308352 7f707dd64700 0 log [INF] : 9.105 scrub ok 2013-05-22 15:44:21.054893 7f707dd64700 0 log [INF] : 9.451 scrub ok 2013-05-22 15:44:52.898784 7f707cd62700 0 log [INF] : 9.784 scrub ok 2013-05-22 15:47:43.148515 7f707cd62700 0 log [INF] : 9.3c3 scrub ok 2013-05-22 15:47:45.717085 7f707dd64700 0 log [INF] : 9.3d0 scrub ok 2013-05-22 15:52:14.573815 7f707dd64700 0 log [ERR] : scrub 3.6b ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head 2013-05-22 15:55:07.230114 7f707d563700 0 log [ERR] : scrub 3.6b 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head 2013-05-22 15:56:56.456242 7f707d563700 0 log [ERR] : scrub 3.6b b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head 2013-05-22 15:57:51.667085 7f707dd64700 0 log [ERR] : 3.6b scrub 3 errors 2013-05-22 15:57:55.241224 7f707dd64700 0 log [INF] : 9.450 scrub ok 2013-05-22 15:57:59.800383 7f707cd62700 0 log [INF] : 9.465 scrub ok 2013-05-22 15:59:55.024065 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 cs=73 l=0).fault with nothing to send, going to standby 2013-05-22 16:01:45.542579 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 74 vs existing 73 state standby -- 2013-05-22 16:29:49.544310 7f707dd64700 0 log [INF] : 9.4eb scrub ok 2013-05-22 16:29:53.190233 7f707dd64700 0 log [INF] : 9.4f4 scrub ok 2013-05-22 16:29:59.478736 7f707dd64700 0 log [INF] : 8.6bb scrub ok 2013-05-22 16:35:12.240246 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 l=0).fault with nothing to send, going to standby 2013-05-22 16:35:19.519019 7f707d563700 0 log [INF] : 8.700 scrub ok 2013-05-22 16:39:15.422532 7f707dd64700 0 log [ERR] : scrub 3.1 b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without head 2013-05-22 16:40:04.995256 7f707cd62700 0 log [ERR] : scrub 3.1 bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without head 2013-05-22 16:41:07.008717 7f707d563700 0 log [ERR] : scrub 3.1 8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without head 2013-05-22 16:41:42.460280 7f707c561700 0 log [ERR] : 3.1 scrub 3 errors 2013-05-22 16:46:12.385678 7f7077735700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 76 vs existing 75 state standby 2013-05-22 16:58:36.079010 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 40 vs existing 39 state standby 2013-05-22 16:58:36.798038 7f707d563700 0 log [INF] : 9.50c scrub ok 2013-05-22 16:58:40.104159 7f707c561700 0 log [INF] : 9.526 scrub ok Note : I have 8 scrub errors like that, on 4 impacted PG, and all impacted objects are about the same RBD image (rb.0.15c26.238e1f29). Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just a écrit : Can you post your ceph.log with the period including all of these errors? -Sam On Wed, May 22, 2013 at 5:39 AM, Dzianis Kahanovich maha...@bspu.unibel.by wrote: Olivier
Re: [ceph-users] scrub error: found clone without head
Can you send the filenames in the pg directories for those 4 pgs? -Sam On Thu, May 23, 2013 at 3:27 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: No : pg 3.7c is active+clean+inconsistent, acting [24,13,39] pg 3.6b is active+clean+inconsistent, acting [28,23,5] pg 3.d is active+clean+inconsistent, acting [29,4,11] pg 3.1 is active+clean+inconsistent, acting [28,19,5] But I suppose that all PG *was* having the osd.25 as primary (on the same host), which is (disabled) buggy OSD. Question : 12d7 in object path is the snapshot id, right ? If it's the case, I haven't got any snapshot with this id for the rb.0.15c26.238e1f29 image. So, which files should I remove ? Thanks for your help. Le jeudi 23 mai 2013 à 15:17 -0700, Samuel Just a écrit : Do all of the affected PGs share osd.28 as the primary? I think the only recovery is probably to manually remove the orphaned clones. -Sam On Thu, May 23, 2013 at 5:00 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Not yet. I keep it for now. Le mercredi 22 mai 2013 à 15:50 -0700, Samuel Just a écrit : rb.0.15c26.238e1f29 Has that rbd volume been removed? -Sam On Wed, May 22, 2013 at 12:18 PM, Olivier Bonvalet ceph.l...@daevel.fr wrote: 0.61-11-g3b94f03 (0.61-1.1), but the bug occured with bobtail. Le mercredi 22 mai 2013 à 12:00 -0700, Samuel Just a écrit : What version are you running? -Sam On Wed, May 22, 2013 at 11:25 AM, Olivier Bonvalet ceph.l...@daevel.fr wrote: Is it enough ? # tail -n500 -f /var/log/ceph/osd.28.log | grep -A5 -B5 'found clone without head' 2013-05-22 15:43:09.308352 7f707dd64700 0 log [INF] : 9.105 scrub ok 2013-05-22 15:44:21.054893 7f707dd64700 0 log [INF] : 9.451 scrub ok 2013-05-22 15:44:52.898784 7f707cd62700 0 log [INF] : 9.784 scrub ok 2013-05-22 15:47:43.148515 7f707cd62700 0 log [INF] : 9.3c3 scrub ok 2013-05-22 15:47:45.717085 7f707dd64700 0 log [INF] : 9.3d0 scrub ok 2013-05-22 15:52:14.573815 7f707dd64700 0 log [ERR] : scrub 3.6b ade3c16b/rb.0.15c26.238e1f29.9221/12d7//3 found clone without head 2013-05-22 15:55:07.230114 7f707d563700 0 log [ERR] : scrub 3.6b 261cc0eb/rb.0.15c26.238e1f29.3671/12d7//3 found clone without head 2013-05-22 15:56:56.456242 7f707d563700 0 log [ERR] : scrub 3.6b b10deaeb/rb.0.15c26.238e1f29.86a2/12d7//3 found clone without head 2013-05-22 15:57:51.667085 7f707dd64700 0 log [ERR] : 3.6b scrub 3 errors 2013-05-22 15:57:55.241224 7f707dd64700 0 log [INF] : 9.450 scrub ok 2013-05-22 15:57:59.800383 7f707cd62700 0 log [INF] : 9.465 scrub ok 2013-05-22 15:59:55.024065 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689000 sd=108 :6803 s=2 pgs=200652 cs=73 l=0).fault with nothing to send, going to standby 2013-05-22 16:01:45.542579 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 74 vs existing 73 state standby -- 2013-05-22 16:29:49.544310 7f707dd64700 0 log [INF] : 9.4eb scrub ok 2013-05-22 16:29:53.190233 7f707dd64700 0 log [INF] : 9.4f4 scrub ok 2013-05-22 16:29:59.478736 7f707dd64700 0 log [INF] : 8.6bb scrub ok 2013-05-22 16:35:12.240246 7f7022770700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689280 sd=99 :6803 s=2 pgs=200667 cs=75 l=0).fault with nothing to send, going to standby 2013-05-22 16:35:19.519019 7f707d563700 0 log [INF] : 8.700 scrub ok 2013-05-22 16:39:15.422532 7f707dd64700 0 log [ERR] : scrub 3.1 b1869301/rb.0.15c26.238e1f29.0836/12d7//3 found clone without head 2013-05-22 16:40:04.995256 7f707cd62700 0 log [ERR] : scrub 3.1 bccad701/rb.0.15c26.238e1f29.9a00/12d7//3 found clone without head 2013-05-22 16:41:07.008717 7f707d563700 0 log [ERR] : scrub 3.1 8a9bec01/rb.0.15c26.238e1f29.9820/12d7//3 found clone without head 2013-05-22 16:41:42.460280 7f707c561700 0 log [ERR] : 3.1 scrub 3 errors 2013-05-22 16:46:12.385678 7f7077735700 0 -- 192.168.42.3:6803/12142 192.168.42.5:6828/31490 pipe(0x2a689c80 sd=137 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 76 vs existing 75 state standby 2013-05-22 16:58:36.079010 7f707661a700 0 -- 192.168.42.3:6803/12142 192.168.42.3:6801/11745 pipe(0x2a689a00 sd=44 :6803 s=0 pgs=0 cs=0 l=0).accept connect_seq 40 vs existing 39 state standby 2013-05-22 16:58:36.798038 7f707d563700 0 log [INF] : 9.50c scrub ok 2013-05-22 16:58:40.104159 7f707c561700 0 log [INF] : 9.526 scrub ok Note : I have 8 scrub errors like that, on 4 impacted PG, and all impacted objects are about the same RBD image (rb.0.15c26.238e1f29). Le mercredi 22 mai 2013 à 11:01 -0700, Samuel Just a écrit : Can you post
Re: [ceph-users] scrub error: found clone without head
Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit : Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause data lost. -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, I have same problem : 8 objects (4 PG) with error found clone without head. How can I fix that ? thanks, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, since pg repair doesn't handle that kind of errors, is there a way to manually fix that ? (it's a production cluster) thanks in advance, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] scrub error: found clone without head
Can you post your ceph.log with the period including all of these errors? -Sam On Wed, May 22, 2013 at 5:39 AM, Dzianis Kahanovich maha...@bspu.unibel.by wrote: Olivier Bonvalet пишет: Le lundi 20 mai 2013 à 00:06 +0200, Olivier Bonvalet a écrit : Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause data lost. I have same problem : 8 objects (4 PG) with error found clone without head. How can I fix that ? since pg repair doesn't handle that kind of errors, is there a way to manually fix that ? (it's a production cluster) Trying to fix manually I cause assertions in trimming process (died OSD). And many others troubles. So, if you want to keep cluster running, wait for developers answer. IMHO. About manual repair attempt: see issue #4937. Also similar results - in subject Inconsistent PG's, repair ineffective. -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] scrub error: found clone without head
Great, thanks. I will follow this issue, and add informations if needed. Le lundi 20 mai 2013 à 17:22 +0300, Dzianis Kahanovich a écrit : http://tracker.ceph.com/issues/4937 For me it progressed up to ceph reinstall with repair data from backup (I help ceph die, but it was IMHO self-provocation for force reinstall). Now (at least to my summer outdoors) I keep v0.62 (3 nodes) with every pool size=3 min_size=2 (was - size=2 min_size=1). But try to do nothing first and try to install latest version. And keep your vote to issue #4937 to force developers. Olivier Bonvalet пишет: Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause data lost. -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, I have same problem : 8 objects (4 PG) with error found clone without head. How can I fix that ? thanks, Olivier -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] scrub error: found clone without head
Le mardi 07 mai 2013 à 15:51 +0300, Dzianis Kahanovich a écrit : I have 4 scrub errors (3 PGs - found clone without head), on one OSD. Not repairing. How to repair it exclude re-creating of OSD? Now it easy to clean+create OSD, but in theory - in case there are multiple OSDs - it may cause data lost. -- WBR, Dzianis Kahanovich AKA Denis Kaganovich, http://mahatma.bspu.unibel.by/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Hi, I have same problem : 8 objects (4 PG) with error found clone without head. How can I fix that ? thanks, Olivier ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com