Ok, have further narrowed this down, when running the OSD in the foreground
with "usr/bin/ceph-osd -i 35 --pid-file /var/run/ceph/osd.35.pid -c
/etc/ceph/ceph.conf --cluster ceph -f"
I see the following, it looks like I'm now chasing down
'ceph::buffer::malformed_input' and 'PGLog::read_log'
[root@cephosd4 ~]# /usr/bin/ceph-osd -i 35 --pid-file /var/run/ceph/osd.35.pid
-c /etc/ceph/ceph.conf --cluster ceph -f
starting osd.35 at :/0 osd_data /var/lib/ceph/osd/ceph-35
/var/lib/ceph/osd/ceph-35/journal
terminate called after throwing an instance of 'ceph::buffer::malformed_input'
what(): buffer::malformed_input: bad checksum on pg_log_entry_t
*** Caught signal (Aborted) **
in thread 7f2eeaef7880
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
1: /usr/bin/ceph-osd() [0xac5c32]
2: (()+0xf130) [0x7f2ee9885130]
3: (gsignal()+0x37) [0x7f2ee829f5d7]
4: (abort()+0x148) [0x7f2ee82a0cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2ee8ba39b5]
6: (()+0x5e926) [0x7f2ee8ba1926]
7: (()+0x5e953) [0x7f2ee8ba1953]
8: (()+0x5eb73) [0x7f2ee8ba1b73]
9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230)
[0x792510]
10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t
const&, std::map<eversion_t, hobject_t, std::less<eversion_t>,
std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>,
std::allocator<char> >&, std::set<std::string, std::less<std::string>,
std::allocator<std::string> >*)+0xa2f) [0x76b9bf]
11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf]
12: (OSD::load_pgs()+0xa7a) [0x6b6a8a]
13: (OSD::init()+0x729) [0x6b9319]
14: (main()+0x27f3) [0x643ed3]
15: (__libc_start_main()+0xf5) [0x7f2ee828baf5]
16: /usr/bin/ceph-osd() [0x65d139]
2016-04-05 12:00:29.990938 7f2eeaef7880 -1 *** Caught signal (Aborted) **
in thread 7f2eeaef7880
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
1: /usr/bin/ceph-osd() [0xac5c32]
2: (()+0xf130) [0x7f2ee9885130]
3: (gsignal()+0x37) [0x7f2ee829f5d7]
4: (abort()+0x148) [0x7f2ee82a0cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2ee8ba39b5]
6: (()+0x5e926) [0x7f2ee8ba1926]
7: (()+0x5e953) [0x7f2ee8ba1953]
8: (()+0x5eb73) [0x7f2ee8ba1b73]
9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230)
[0x792510]
10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t
const&, std::map<eversion_t, hobject_t, std::less<eversion_t>,
std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>,
std::allocator<char> >&, std::set<std::string, std::less<std::string>,
std::allocator<std::string> >*)+0xa2f) [0x76b9bf]
11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf]
12: (OSD::load_pgs()+0xa7a) [0x6b6a8a]
13: (OSD::init()+0x729) [0x6b9319]
14: (main()+0x27f3) [0x643ed3]
15: (__libc_start_main()+0xf5) [0x7f2ee828baf5]
16: /usr/bin/ceph-osd() [0x65d139]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
0> 2016-04-05 12:00:29.990938 7f2eeaef7880 -1 *** Caught signal (Aborted)
**
in thread 7f2eeaef7880
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
1: /usr/bin/ceph-osd() [0xac5c32]
2: (()+0xf130) [0x7f2ee9885130]
3: (gsignal()+0x37) [0x7f2ee829f5d7]
4: (abort()+0x148) [0x7f2ee82a0cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f2ee8ba39b5]
6: (()+0x5e926) [0x7f2ee8ba1926]
7: (()+0x5e953) [0x7f2ee8ba1953]
8: (()+0x5eb73) [0x7f2ee8ba1b73]
9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230)
[0x792510]
10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t
const&, std::map<eversion_t, hobject_t, std::less<eversion_t>,
std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>,
std::allocator<char> >&, std::set<std::string, std::less<std::string>,
std::allocator<std::string> >*)+0xa2f) [0x76b9bf]
11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf]
12: (OSD::load_pgs()+0xa7a) [0x6b6a8a]
13: (OSD::init()+0x729) [0x6b9319]
14: (main()+0x27f3) [0x643ed3]
15: (__libc_start_main()+0xf5) [0x7f2ee828baf5]
16: /usr/bin/ceph-osd() [0x65d139]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
Aborted
[root@cephosd4 ~]#
From: ceph-users [mailto:[email protected]] On Behalf Of Cory
Hawkless
Sent: Tuesday, 5 April 2016 1:19 AM
To: [email protected]
Subject: Re: [ceph-users] OSD not coming up
Ok, we seem to have traced the issue down to this
" -4> 2016-04-05 00:59:37.790223 7fbde47a0880 20 read_log 1270'2332
(1270'911) delete 50686c09/rb.0.6f57.2506ccfd.00000000bb2a/head//34 by
client.95981.0:47944 2015-10-13 16:47:23.620857
-3> 2016-04-05 00:59:37.790236 7fbde47a0880 20 read_log 1270'2333
(1270'917) delete 7c7a1409/rb.0.6f57.2506ccfd.00000000bcc6/head//34 by
client.95981.0:48356 2015-10-13 16:47:24.523262
-2> 2016-04-05 00:59:37.790256 7fbde47a0880 20 read_log 1270'2334
(1270'983) delete 483dbe09/rb.0.6f57.2506ccfd.00000000c9e9/head//34 by
client.95981.0:51720 2015-10-13 16:47:31.584863
-1> 2016-04-05 00:59:37.790269 7fbde47a0880 20 read_log 1270'2335
(1270'985) delete 85411809/rb.0.6f57.2506ccfd.00000000cd66/head//34 by
client.95981.0:52614 2015-10-13 16:47:33.278702
0> 2016-04-05 00:59:37.793431 7fbde47a0880 -1 *** Caught signal (Aborted)
**
in thread 7fbde47a0880
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
1: /usr/bin/ceph-osd() [0xac5c32]
2: (()+0xf130) [0x7fbde312e130]
3: (gsignal()+0x37) [0x7fbde1b485d7]
4: (abort()+0x148) [0x7fbde1b49cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) ["
Here is a link to the full log of the OSD starting with debugging enabled
https://www.dropbox.com/s/i1x2vqcgp7qkr96/ceph-osd.35-3.log?dl=0
From: ceph-users [mailto:[email protected]] On Behalf Of Cory
Hawkless
Sent: Monday, 4 April 2016 6:27 PM
To: [email protected]
Subject: [ceph-users] OSD not coming up
I've moved a whole set of OSD's from one server to another after a filure.
All other OSD's have come up no problems and Ceph is busy rebalancing itself
after the change, but one of my OSD's is being difficult. Below is the output
from the log file when I use 'service ceph restart osd.35' to try and bring it
online.
I can see any usefull errors here to start chasing down, can you?
-32> 2016-04-04 16:07:13.425771 7f28d457f880 2 journal read_entry
4048896000 : seq 283263182 137 bytes
-31> 2016-04-04 16:07:13.425780 7f28d457f880 2 journal read_entry
4048900096 : seq 283263183 137 bytes
-30> 2016-04-04 16:07:13.425791 7f28d457f880 2 journal No further valid
entries found, journal is most likely valid
-29> 2016-04-04 16:07:13.425802 7f28d457f880 2 journal No further valid
entries found, journal is most likely valid
-28> 2016-04-04 16:07:13.425809 7f28d457f880 3 journal journal_replay: end
of journal, done.
-27> 2016-04-04 16:07:13.431977 7f28d457f880 1 journal _open
/var/lib/ceph/osd/ceph-35/journal fd 20: 5367660544 bytes, block size 4096
bytes, directio = 1, aio = 1
-26> 2016-04-04 16:07:13.432317 7f28d457f880 2 osd.35 0 boot
-25> 2016-04-04 16:07:13.462245 7f28d457f880 0 <cls>
cls/hello/cls_hello.cc:271: loading cls_hello
-24> 2016-04-04 16:07:13.463479 7f28d457f880 1 <cls>
cls/log/cls_log.cc:312: Loaded log class!
-23> 2016-04-04 16:07:13.468017 7f28d457f880 1 <cls>
cls/refcount/cls_refcount.cc:231: Loaded refcount class!
-22> 2016-04-04 16:07:13.468228 7f28d457f880 1 <cls>
cls/replica_log/cls_replica_log.cc:141: Loaded replica log class!
-21> 2016-04-04 16:07:13.472035 7f28d457f880 1 <cls>
cls/rgw/cls_rgw.cc:3047: Loaded rgw class!
-20> 2016-04-04 16:07:13.472480 7f28d457f880 1 <cls>
cls/statelog/cls_statelog.cc:306: Loaded log class!
-19> 2016-04-04 16:07:13.472890 7f28d457f880 1 <cls>
cls/user/cls_user.cc:367: Loaded user class!
-18> 2016-04-04 16:07:13.473231 7f28d457f880 1 <cls>
cls/version/cls_version.cc:227: Loaded version class!
-17> 2016-04-04 16:07:13.565117 7f28d457f880 0 osd.35 3612 crush map has
features 1107558400, adjusting msgr requires for clients
-16> 2016-04-04 16:07:13.565139 7f28d457f880 0 osd.35 3612 crush map has
features 1107558400 was 8705, adjusting msgr requires for mons
-15> 2016-04-04 16:07:13.565149 7f28d457f880 0 osd.35 3612 crush map has
features 1107558400, adjusting msgr requires for osds
-14> 2016-04-04 16:07:13.565171 7f28d457f880 0 osd.35 3612 load_pgs
-13> 2016-04-04 16:07:36.965045 7f28d457f880 5 osd.35 pg_epoch: 3604
pg[34.2(unlocked)] enter Initial
-12> 2016-04-04 16:07:36.965642 7f28d457f880 5 osd.35 pg_epoch: 3604
pg[34.2( empty local-les=0 n=0 ec=709 les/c 3578/3582 3588/3588/3542) [38,35]
r=1 lpr=0 pi=3542-3587/1 crt=0'0 inactive NOTIFY] exit Initial 0.000598 0
0.000000
-11> 2016-04-04 16:07:36.965691 7f28d457f880 5 osd.35 pg_epoch: 3604
pg[34.2( empty local-les=0 n=0 ec=709 les/c 3578/3582 3588/3588/3542) [38,35]
r=1 lpr=0 pi=3542-3587/1 crt=0'0 inactive NOTIFY] enter Reset
-10> 2016-04-04 16:07:36.987602 7f28d457f880 5 osd.35 pg_epoch: 3587
pg[34.3(unlocked)] enter Initial
-9> 2016-04-04 16:07:37.011658 7f28d457f880 5 osd.35 pg_epoch: 3587
pg[34.3( v 3334'2172 (0'0,3334'2172] local-les=3562 n=0 ec=709 les/c 3562/3587
3516/3516/3459) [35,34] r=0 lpr=0 crt=3334'2170 lcod 0'0 mlcod 0'0 inactive]
exit Initial 0.024056 0 0.000000
-8> 2016-04-04 16:07:37.011696 7f28d457f880 5 osd.35 pg_epoch: 3587
pg[34.3( v 3334'2172 (0'0,3334'2172] local-les=3562 n=0 ec=709 les/c 3562/3587
3516/3516/3459) [35,34] r=0 lpr=0 crt=3334'2170 lcod 0'0 mlcod 0'0 inactive]
enter Reset
-7> 2016-04-04 16:07:37.067070 7f28d457f880 5 osd.35 pg_epoch: 3481
pg[34.7(unlocked)] enter Initial
-6> 2016-04-04 16:07:37.097469 7f28d457f880 5 osd.35 pg_epoch: 3481
pg[34.7( v 3334'2635 (0'0,3334'2635] local-les=3465 n=0 ec=709 les/c 3465/3479
3459/3459/3369) [34,35] r=1 lpr=0 pi=3448-3458/1 crt=3334'2635 lcod 0'0
inactive NOTIFY] exit Initial 0.030398 0 0.000000
-5> 2016-04-04 16:07:37.097531 7f28d457f880 5 osd.35 pg_epoch: 3481
pg[34.7( v 3334'2635 (0'0,3334'2635] local-les=3465 n=0 ec=709 les/c 3465/3479
3459/3459/3369) [34,35] r=1 lpr=0 pi=3448-3458/1 crt=3334'2635 lcod 0'0
inactive NOTIFY] enter Reset
-4> 2016-04-04 16:07:37.146741 7f28d457f880 5 osd.35 pg_epoch: 3582
pg[34.8(unlocked)] enter Initial
-3> 2016-04-04 16:07:37.172346 7f28d457f880 5 osd.35 pg_epoch: 3582
pg[34.8( v 3334'2042 (0'0,3334'2042] lb 0//0//-1 local-les=3494 n=0 ec=709
les/c 3494/3502 3542/3542/3368) [33,38] r=-1 lpr=0 pi=3368-3541/4 crt=3334'2042
lcod 0'0 inactive NOTIFY] exit Initial 0.025604 0 0.000000
-2> 2016-04-04 16:07:37.172387 7f28d457f880 5 osd.35 pg_epoch: 3582
pg[34.8( v 3334'2042 (0'0,3334'2042] lb 0//0//-1 local-les=3494 n=0 ec=709
les/c 3494/3502 3542/3542/3368) [33,38] r=-1 lpr=0 pi=3368-3541/4 crt=3334'2042
lcod 0'0 inactive NOTIFY] enter Reset
-1> 2016-04-04 16:07:37.189150 7f28d457f880 5 osd.35 pg_epoch: 3585
pg[34.9(unlocked)] enter Initial
0> 2016-04-04 16:07:37.268571 7f28d457f880 -1 *** Caught signal (Aborted)
**
in thread 7f28d457f880
ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
1: /usr/bin/ceph-osd() [0xac5c32]
2: (()+0xf130) [0x7f28d2f0d130]
3: (gsignal()+0x37) [0x7f28d19275d7]
4: (abort()+0x148) [0x7f28d1928cc8]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f28d222b9b5]
6: (()+0x5e926) [0x7f28d2229926]
7: (()+0x5e953) [0x7f28d2229953]
8: (()+0x5eb73) [0x7f28d2229b73]
9: (pg_log_entry_t::decode_with_checksum(ceph::buffer::list::iterator&)+0x230)
[0x792510]
10: (PGLog::read_log(ObjectStore*, coll_t, coll_t, ghobject_t, pg_info_t
const&, std::map<eversion_t, hobject_t, std::less<eversion_t>,
std::allocator<std::pair<eversion_t const, hobject_t> > >&, PGLog::IndexedLog&,
pg_missing_t&, std::basic_ostringstream<char, std::char_traits<char>,
std::allocator<char> >&, std::set<std::string, std::less<std::string>,
std::allocator<std::string> >*)+0xa2f) [0x76b9bf]
11: (PG::read_state(ObjectStore*, ceph::buffer::list&)+0x34f) [0x7edebf]
12: (OSD::load_pgs()+0xa7a) [0x6b6a8a]
13: (OSD::init()+0x729) [0x6b9319]
14: (main()+0x27f3) [0x643ed3]
15: (__libc_start_main()+0xf5) [0x7f28d1913af5]
16: /usr/bin/ceph-osd() [0x65d139]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 keyvaluestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.35.log
--- end dump of recent events ---
[root@cephosd4 ~]#
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com