Le lundi 15 avril 2013 à 10:57 -0700, Gregory Farnum a écrit :
> On Mon, Apr 15, 2013 at 10:19 AM, Olivier Bonvalet <[email protected]>
> wrote:
> > Le lundi 15 avril 2013 à 10:16 -0700, Gregory Farnum a écrit :
> >> Are you saying you saw this problem more than once, and so you
> >> completely wiped the OSD in question, then brought it back into the
> >> cluster, and now it's seeing this error again?
> >
> > Yes, it's exactly that.
> >
> >
> >> Are any other OSDs experiencing this issue?
> >
> > No, only this one have the problem.
>
> Did you run scrubs while this node was out of the cluster? If you
> wiped the data and this is recurring then this is apparently an issue
> with the cluster state, not just one node, and any other primary for
> the broken PG(s) should crash as well. Can you verify by taking this
> one down and then doing a full scrub?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
So, I mark this OSD as "out" to balance data and be able to re-do a
scrum. You are probably right, since I now have 3 other OSD on the same
host which are down.
I still haven't any PG in error (and the cluster is in HEALTH_WARN
status), but something goes wrong.
In syslog I have :
Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742915 7fe651131700 -1
filestore(/var/lib/ceph/osd/ceph-31) could not find
d452999/rb.0.2d76b.238e1f29.00000000162d/f99//4 in index: (2) No such file or
directory
Apr 16 02:07:08 alim ceph-osd: 2013-04-16 02:07:06.742999 7fe651131700 -1
filestore(/var/lib/ceph/osd/ceph-31) could not find
85242299/rb.0.1367.2ae8944a.000000001f9c/f98//4 in index: (2) No such file or
directory
Apr 16 03:41:11 alim ceph-osd: 2013-04-16 03:41:11.758150 7fe64f12d700 -1
osd.31 48020 heartbeat_check: no reply from osd.5 since 2013-04-16
03:40:50.349130 (cutoff 2013-04-16 03:40:51.758149)
Apr 16 04:27:40 alim ceph-osd: 2013-04-16 04:27:40.529492 7fe65e14b700 -1
osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16
04:27:20.203868 (cutoff 2013-04-16 04:27:20.529489)
Apr 16 04:27:41 alim ceph-osd: 2013-04-16 04:27:41.529609 7fe65e14b700 -1
osd.31 48416 heartbeat_check: no reply from osd.26 since 2013-04-16
04:27:20.203868 (cutoff 2013-04-16 04:27:21.529605)
Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.440257 7fe64f12d700 -1
osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16
05:01:22.529918 (cutoff 2013-04-16 05:01:23.440257)
Apr 16 05:01:43 alim ceph-osd: 2013-04-16 05:01:43.523985 7fe65e14b700 -1
osd.31 48602 heartbeat_check: no reply from osd.4 since 2013-04-16
05:01:22.529918 (cutoff 2013-04-16 05:01:23.523984)
Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:27.770327 7fe65e14b700 -1
osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16
05:55:07.392502 (cutoff 2013-04-16 05:55:07.770323)
Apr 16 05:55:28 alim ceph-osd: 2013-04-16 05:55:28.497600 7fe64f12d700 -1
osd.31 48847 heartbeat_check: no reply from osd.26 since 2013-04-16
05:55:07.392502 (cutoff 2013-04-16 05:55:08.497598)
Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.051839 7fe65012f700 -1
osd/ReplicatedPG.cc: In function 'virtual void ReplicatedPG::_scrub(ScrubMap&)'
thread 7fe65012f700 time 2013-04-16 06:04:12.843977#012osd/ReplicatedPG.cc:
7188: FAILED assert(head != hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e
(d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1:
(ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2:
(PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9)
[0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5:
(OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6:
(ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7:
(ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca)
[0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.051839 7fe65012f700
-1 osd/ReplicatedPG.cc: In function 'virtual void
ReplicatedPG::_scrub(ScrubMap&)' thread 7fe65012f700 time 2013-04-16
06:04:12.843977#012osd/ReplicatedPG.cc: 7188: FAILED assert(head !=
hobject_t())#012#012 ceph version 0.56.4-4-gd89ab0e
(d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1:
(ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 2:
(PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 3: (PG::chunky_scrub()+0x2d9)
[0x6c37f9]#012 4: (PG::scrub()+0x145) [0x6c4e55]#012 5:
(OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 6:
(ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 7:
(ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 8: (()+0x68ca)
[0x7fe6626558ca]#012 9: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of the
executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 16 06:04:13 alim ceph-osd: 2013-04-16 06:04:13.277072 7fe65012f700 -1 ***
Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph version
0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1:
/usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3:
(gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6:
(()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8:
(()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x7c9) [0x8f9549]#012 10:
(ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11:
(PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9)
[0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14:
(OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15:
(ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16:
(ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca)
[0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of
the executable, or `objdump -rdS <executable>` is needed to interpret this.
Apr 16 06:04:13 alim ceph-osd: 0> 2013-04-16 06:04:13.277072 7fe65012f700
-1 *** Caught signal (Aborted) **#012 in thread 7fe65012f700#012#012 ceph
version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)#012 1:
/usr/bin/ceph-osd() [0x7a6289]#012 2: (()+0xeff0) [0x7fe66265dff0]#012 3:
(gsignal()+0x35) [0x7fe6610e71b5]#012 4: (abort()+0x180) [0x7fe6610e9fc0]#012
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7fe66197bdc5]#012 6:
(()+0xcb166) [0x7fe66197a166]#012 7: (()+0xcb193) [0x7fe66197a193]#012 8:
(()+0xcb28e) [0x7fe66197a28e]#012 9: (ceph::__ceph_assert_fail(char const*,
char const*, int, char const*)+0x7c9) [0x8f9549]#012 10:
(ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]#012 11:
(PG::scrub_compare_maps()+0xeb8) [0x696c18]#012 12: (PG::chunky_scrub()+0x2d9)
[0x6c37f9]#012 13: (PG::scrub()+0x145) [0x6c4e55]#012 14:
(OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]#012 15:
(ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]#012 16:
(ThreadPool::WorkThread::entry()+0x10) [0x817980]#012 17: (()+0x68ca)
[0x7fe6626558ca]#012 18: (clone()+0x6d) [0x7fe661184b6d]#012 NOTE: a copy of
the executable, or `objdump -rdS <executable>` is needed to interpret this.
and last lines from osd.24.log are :
-10> 2013-04-16 08:08:54.991371 7f5bb4569700 2 osd.24 pg_epoch: 49397
pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382
n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375
mlcod 49397'11387808 active+clean+scrubbing+deep
snaptrimq=[1f86~5,1f8c~8,2bd8~e]] scrub osd.6 has 10 items
-9> 2013-04-16 08:08:54.991876 7f5bb4569700 2 osd.24 pg_epoch: 49397
pg[3.7c( v 49397'11387809 (49355'11386808,49397'11387809] local-les=49382
n=18795 ec=56 les/c 49382/49397 49375/49375/49375) [24,13,6] r=0 lpr=49375
mlcod 49397'11387808 active+clean+scrubbing+deep
snaptrimq=[1f86~5,1f8c~8,2bd8~e]] 3.7c osd.24 inconsistent snapcolls on
7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7
3.7c osd.13 inconsistent snapcolls on
7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7
3.7c osd.13: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size
4194304 != known size 0, digest 1360833101 != known digest 0
3.7c osd.6 inconsistent snapcolls on
7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 found expected 12d7
3.7c osd.6: soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304
!= known size 0, digest 1360833101 != known digest 0
-8> 2013-04-16 08:08:54.991906 7f5bb4569700 0 log [ERR] : 3.7c osd.24
inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3
found expected 12d7
-7> 2013-04-16 08:08:54.991913 7f5bb4569700 0 log [ERR] : 3.7c osd.13
inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3
found expected 12d7
-6> 2013-04-16 08:08:54.991915 7f5bb4569700 0 log [ERR] : 3.7c osd.13:
soid 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known
size 0, digest 1360833101 != known digest 0
-5> 2013-04-16 08:08:54.991917 7f5bb4569700 0 log [ERR] : 3.7c osd.6
inconsistent snapcolls on 7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3
found expected 12d7
-4> 2013-04-16 08:08:54.991919 7f5bb4569700 0 log [ERR] : 3.7c osd.6: soid
7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 size 4194304 != known size 0,
digest 1360833101 != known digest 0
-3> 2013-04-16 08:08:54.991986 7f5bb4569700 0 log [ERR] : deep-scrub 3.7c
7b52237c/rb.0.15c26.238e1f29.000000000a76/12d7//3 on disk size (0) does not
match object info size (4194304)
-2> 2013-04-16 08:08:54.993813 7f5bbbd78700 5 --OSD::tracker-- reqid:
client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993812, event:
op_applied, request: osd_op(client.1811920.1:164200641
rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc
2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98])
-1> 2013-04-16 08:08:54.993901 7f5bbbd78700 5 --OSD::tracker-- reqid:
client.1811920.1:164200641, seq: 4575, time: 2013-04-16 08:08:54.993901, event:
done, request: osd_op(client.1811920.1:164200641
rb.0.2d766.238e1f29.000000002c80 [write 442368~4096] 3.31aa16ec snapc
2bc6=[2bc6,2b84,2b33,2ada,2a0e,2940,285f,2775,2633,25e8,24d1,2452,2363,22c6,21a9,20db,1f98])
0> 2013-04-16 08:08:54.994227 7f5bb4569700 -1 osd/ReplicatedPG.cc: In
function 'virtual void ReplicatedPG::_scrub(ScrubMap&)' thread 7f5bb4569700
time 2013-04-16 08:08:54.991990
osd/ReplicatedPG.cc: 7188: FAILED assert(head != hobject_t())
ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
1: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
2: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
3: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
4: (PG::scrub()+0x145) [0x6c4e55]
5: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
6: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
7: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
8: (()+0x68ca) [0x7f5bc72908ca]
9: (clone()+0x6d) [0x7f5bc5dbfb6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-1/-1 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/osd.24.log
--- end dump of recent events ---
2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted) **
in thread 7f5bb4569700
ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
1: /usr/bin/ceph-osd() [0x7a6289]
2: (()+0xeff0) [0x7f5bc7298ff0]
3: (gsignal()+0x35) [0x7f5bc5d221b5]
4: (abort()+0x180) [0x7f5bc5d24fc0]
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5]
6: (()+0xcb166) [0x7f5bc65b5166]
7: (()+0xcb193) [0x7f5bc65b5193]
8: (()+0xcb28e) [0x7f5bc65b528e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x8f9549]
10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
13: (PG::scrub()+0x145) [0x6c4e55]
14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
17: (()+0x68ca) [0x7f5bc72908ca]
18: (clone()+0x6d) [0x7f5bc5dbfb6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- begin dump of recent events ---
0> 2013-04-16 08:08:55.163535 7f5bb4569700 -1 *** Caught signal (Aborted)
**
in thread 7f5bb4569700
ceph version 0.56.4-4-gd89ab0e (d89ab0ea6fa8d0961cad82f6a81eccbd3bbd3f55)
1: /usr/bin/ceph-osd() [0x7a6289]
2: (()+0xeff0) [0x7f5bc7298ff0]
3: (gsignal()+0x35) [0x7f5bc5d221b5]
4: (abort()+0x180) [0x7f5bc5d24fc0]
5: (__gnu_cxx::__verbose_terminate_handler()+0x115) [0x7f5bc65b6dc5]
6: (()+0xcb166) [0x7f5bc65b5166]
7: (()+0xcb193) [0x7f5bc65b5193]
8: (()+0xcb28e) [0x7f5bc65b528e]
9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x7c9) [0x8f9549]
10: (ReplicatedPG::_scrub(ScrubMap&)+0x1a78) [0x57a038]
11: (PG::scrub_compare_maps()+0xeb8) [0x696c18]
12: (PG::chunky_scrub()+0x2d9) [0x6c37f9]
13: (PG::scrub()+0x145) [0x6c4e55]
14: (OSD::ScrubWQ::_process(PG*)+0xc) [0x64048c]
15: (ThreadPool::worker(ThreadPool::WorkThread*)+0x879) [0x815179]
16: (ThreadPool::WorkThread::entry()+0x10) [0x817980]
17: (()+0x68ca) [0x7f5bc72908ca]
18: (clone()+0x6d) [0x7f5bc5dbfb6d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
0/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/ 5 hadoop
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
-1/-1 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/osd.24.log
--- end dump of recent events ---
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html