Thanks Wang, looks like so, not Ceph to blame :) On 25 October 2016 at 09:59, Haomai Wang <hao...@xsky.com> wrote:
> could you check dmesg? I think there exists disk EIO error > > On Tue, Oct 25, 2016 at 9:58 AM, Zhang Qiang <dotslash...@gmail.com> > wrote: > >> Hi, >> >> One of several OSDs on the same machine crashed several times within >> days. It's always that one, other OSDs are all fine. Below is the dumped >> message, since it's too long here, I only pasted the head and tail of the >> recent events. If it's necessary to inspect the full log, please see >> https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244ac23f80. >> >> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function >> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t, >> ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24 >> 18:52:06.213123 >> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio >> || got != -5) >> >> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbc9195] >> 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned >> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34] >> 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, >> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1] >> 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, >> std::allocator<hobject_t> > const&, bool, unsigned int, >> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8] >> 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, >> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53] >> 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) >> [0x7df722] >> 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, >> ThreadPool::TPHandle&)+0xbe) [0x6dcade] >> 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] >> 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] >> 10: (()+0x7dc5) [0x7f309cd26dc5] >> 11: (clone()+0x6d) [0x7f309b80821d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> --- begin dump of recent events --- >> -10000> 2016-10-24 18:51:34.341035 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940 >> -9999> 2016-10-24 18:51:34.341046 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0 >> -9998> 2016-10-24 18:51:34.341058 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080 >> -9997> 2016-10-24 18:51:34.341069 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0 >> -9996> 2016-10-24 18:51:34.341080 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00 >> -9995> 2016-10-24 18:51:34.341090 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160 >> -9994> 2016-10-24 18:51:34.341101 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60 >> -9993> 2016-10-24 18:51:34.341113 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0 >> -9992> 2016-10-24 18:51:34.341128 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0 >> -9991> 2016-10-24 18:51:34.341139 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0 >> -9990> 2016-10-24 18:51:34.341130 7f3088a48700 1 -- 10.3.149.62:0/25857 >> <== osd.1 10.3.149.55:6835/2010188 187557 ==== osd_ping(ping_reply e3014 >> stamp 2016-10-24 18:51:34.340550) v2 ==== 47+0+0 (1550182756 0 0) >> 0x1a83bc00 con 0x7874580 >> -9989> 2016-10-24 18:51:34.341151 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.57:6814/26469 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x1f48aa00 con 0x175bfa20 >> -9988> 2016-10-24 18:51:34.341162 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.62:6811/26469 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x24456e00 con 0x175bfb80 >> -9987> 2016-10-24 18:51:34.341174 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.58:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x25c59e00 con 0x7874f20 >> -9986> 2016-10-24 18:51:34.341186 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.63:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x19703c00 con 0x7875760 >> -9985> 2016-10-24 18:51:34.341208 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.58:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x19702600 con 0x26444940 >> -9984> 2016-10-24 18:51:34.341231 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.63:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0xa67da00 con 0x7874c60 >> -9983> 2016-10-24 18:51:34.341249 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.58:6809/2023604 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x22111000 con 0x17887860 >> -9982> 2016-10-24 18:51:34.341262 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.63:6811/2023604 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x1fe62200 con 0x17887de0 >> -9981> 2016-10-24 18:51:34.341281 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.58:6802/2023892 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x1fc32c00 con 0x24246100 >> -9980> 2016-10-24 18:51:34.341297 7f307b22d700 1 -- 10.3.149.62:0/25857 >> --> 10.3.149.63:6801/2023892 -- osd_ping(ping e3014 stamp 2016-10-24 >> 18:51:34.340550) v2 -- ?+0 0x20544c00 con 0x24246d60 >> . >> . >> . >> -20> 2016-10-24 18:52:05.273121 7f3086243700 1 -- >> 10.3.149.57:6811/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x27c1a600 con 0x1744aaa0 >> -19> 2016-10-24 18:52:05.273129 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 <== osd.1 10.3.149.60:0/10188 187279 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.212809) v2 ==== 47+0+0 >> (387409057 0 0) 0x1ff4f600 con 0x175b1860 >> -18> 2016-10-24 18:52:05.273157 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x10d73a00 con 0x175b1860 >> -17> 2016-10-24 18:52:05.641202 7f3086243700 1 -- >> 10.3.149.57:6811/25857 <== osd.29 10.3.149.59:0/35501 187818 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0 >> (3027252596 0 0) 0x9d0a200 con 0x175172e0 >> -16> 2016-10-24 18:52:05.641209 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 <== osd.29 10.3.149.59:0/35501 187818 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0 >> (3027252596 0 0) 0xa27ba00 con 0x264422c0 >> -15> 2016-10-24 18:52:05.641246 7f3086243700 1 -- >> 10.3.149.57:6811/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1b8a6200 con 0x175172e0 >> -14> 2016-10-24 18:52:05.641290 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1ff4f600 con 0x264422c0 >> -13> 2016-10-24 18:52:05.689610 7f3086243700 1 -- >> 10.3.149.57:6811/25857 <== osd.13 10.3.149.56:0/5402 187624 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0 >> (1310408758 0 0) 0x1be24600 con 0x15268b00 >> -12> 2016-10-24 18:52:05.689664 7f3086243700 1 -- >> 10.3.149.57:6811/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0x9d0a200 con 0x15268b00 >> -11> 2016-10-24 18:52:05.689661 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 <== osd.13 10.3.149.56:0/5402 187624 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0 >> (1310408758 0 0) 0x19705600 con 0x175b1de0 >> -10> 2016-10-24 18:52:05.689729 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0xa27ba00 con 0x175b1de0 >> -9> 2016-10-24 18:52:05.861925 7f3086243700 1 -- >> 10.3.149.57:6811/25857 <== osd.4 10.3.149.60:0/12742 187653 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0 >> (350590821 0 0) 0x12169400 con 0x17514000 >> -8> 2016-10-24 18:52:05.861957 7f3086243700 1 -- >> 10.3.149.57:6811/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x1be24600 con 0x17514000 >> -7> 2016-10-24 18:52:05.861963 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 <== osd.4 10.3.149.60:0/12742 187653 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0 >> (350590821 0 0) 0x269fba00 con 0x26442840 >> -6> 2016-10-24 18:52:05.862015 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x19705600 con 0x26442840 >> -5> 2016-10-24 18:52:05.882605 7f3094bb6700 5 osd.19 3014 tick >> -4> 2016-10-24 18:52:05.988572 7f3086243700 1 -- >> 10.3.149.57:6811/25857 <== osd.25 10.3.149.58:0/24382 187898 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0 >> (3778423740 0 0) 0xae91200 con 0x177bb760 >> -3> 2016-10-24 18:52:05.988582 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 <== osd.25 10.3.149.58:0/24382 187898 ==== >> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0 >> (3778423740 0 0) 0x1a396000 con 0x1526bc80 >> -2> 2016-10-24 18:52:05.988608 7f3086243700 1 -- >> 10.3.149.57:6811/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x12169400 con 0x177bb760 >> -1> 2016-10-24 18:52:05.988652 7f3087a46700 1 -- >> 10.3.149.62:6810/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply >> e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x269fba00 con 0x1526bc80 >> 0> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In >> function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, >> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time >> 2016-10-24 18:52:06.213123 >> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio >> || got != -5) >> >> ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403) >> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char >> const*)+0x85) [0xbc9195] >> 2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned >> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34] >> 3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int, >> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1] >> 4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t, >> std::allocator<hobject_t> > const&, bool, unsigned int, >> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8] >> 5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool, >> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53] >> 6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2) >> [0x7df722] >> 7: (OSD::RepScrubWQ::_process(MOSDRepScrub*, >> ThreadPool::TPHandle&)+0xbe) [0x6dcade] >> 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966] >> 9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0] >> 10: (()+0x7dc5) [0x7f309cd26dc5] >> 11: (clone()+0x6d) [0x7f309b80821d] >> NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed >> to interpret this. >> >> --- logging levels --- >> 0/ 5 none >> 0/ 1 lockdep >> 0/ 1 context >> 1/ 1 crush >> 1/ 5 mds >> 1/ 5 mds_balancer >> 1/ 5 mds_locker >> 1/ 5 mds_log >> 1/ 5 mds_log_expire >> 1/ 5 mds_migrator >> 0/ 1 buffer >> 0/ 1 timer >> 0/ 1 filer >> 0/ 1 striper >> 0/ 1 objecter >> 0/ 5 rados >> 0/ 5 rbd >> 0/ 5 rbd_replay >> 0/ 5 journaler >> 0/ 5 objectcacher >> 0/ 5 client >> 0/ 5 osd >> 0/ 5 optracker >> 0/ 5 objclass >> 1/ 3 filestore >> 1/ 3 keyvaluestore >> 1/ 3 journal >> 0/ 5 ms >> 1/ 5 mon >> 0/10 monc >> 1/ 5 paxos >> 0/ 5 tp >> 1/ 5 auth >> 1/ 5 crypto >> 1/ 1 finisher >> 1/ 5 heartbeatmap >> 1/ 5 perfcounter >> 1/ 5 rgw >> 1/10 civetweb >> 1/ 5 javaclient >> 1/ 5 asok >> 1/ 1 throttle >> 0/ 0 refs >> 1/ 5 xio >> -2/-2 (syslog threshold) >> -1/-1 (stderr threshold) >> max_recent 10000 >> max_new 1000 >> log_file /var/log/ceph/ceph-osd.19.log >> --- end dump of recent events --- >> >> Since ceph-osd objdump is too large to put in a mail, I will not attach >> it, but if it is needed i'll find a way to share it. What might be the >> cause? Can any one help me with this? Thanks. >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com