Thanks Wang, looks like so, not Ceph to blame :)

On 25 October 2016 at 09:59, Haomai Wang <hao...@xsky.com> wrote:

> could you check dmesg? I think there exists disk EIO error
>
> On Tue, Oct 25, 2016 at 9:58 AM, Zhang Qiang <dotslash...@gmail.com>
> wrote:
>
>> Hi,
>>
>> One of several OSDs on the same machine crashed several times within
>> days. It's always that one, other OSDs are all fine. Below is the dumped
>> message, since it's too long here, I only pasted the head and tail of the
>> recent events. If it's necessary to inspect the full log, please see
>> https://gist.github.com/dotSlashLu/3e8ca9491fbf07636a4583244ac23f80.
>>
>> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In function
>> 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t, size_t,
>> ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time 2016-10-24
>> 18:52:06.213123
>> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbc9195]
>>  2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
>> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]
>>  3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
>> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1]
>>  4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
>> std::allocator<hobject_t> > const&, bool, unsigned int,
>> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8]
>>  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
>> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53]
>>  6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2)
>> [0x7df722]
>>  7: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
>> ThreadPool::TPHandle&)+0xbe) [0x6dcade]
>>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
>>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
>>  10: (()+0x7dc5) [0x7f309cd26dc5]
>>  11: (clone()+0x6d) [0x7f309b80821d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- begin dump of recent events ---
>> -10000> 2016-10-24 18:51:34.341035 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.56:6821/4808 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x175a2c00 con 0x1526a940
>>  -9999> 2016-10-24 18:51:34.341046 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.61:6817/4808 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x175a3600 con 0x15269fa0
>>  -9998> 2016-10-24 18:51:34.341058 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.56:6823/5402 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x12aaa400 con 0x27bc9080
>>  -9997> 2016-10-24 18:51:34.341069 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.61:6821/5402 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x1f89ec00 con 0x27bc91e0
>>  -9996> 2016-10-24 18:51:34.341080 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.56:6824/6216 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0xaa16000 con 0x175b0c00
>>  -9995> 2016-10-24 18:51:34.341090 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.61:6818/6216 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x23b87800 con 0x175ae160
>>  -9994> 2016-10-24 18:51:34.341101 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.57:6802/23367 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x258ed400 con 0x17500d60
>>  -9993> 2016-10-24 18:51:34.341113 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.62:6806/23367 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x242bb000 con 0x175019c0
>>  -9992> 2016-10-24 18:51:34.341128 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.57:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x28e41c00 con 0x1744aec0
>>  -9991> 2016-10-24 18:51:34.341139 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.62:6805/25009 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x10be5200 con 0x175bf8c0
>>  -9990> 2016-10-24 18:51:34.341130 7f3088a48700  1 -- 10.3.149.62:0/25857
>> <== osd.1 10.3.149.55:6835/2010188 187557 ==== osd_ping(ping_reply e3014
>> stamp 2016-10-24 18:51:34.340550) v2 ==== 47+0+0 (1550182756 0 0)
>> 0x1a83bc00 con 0x7874580
>>  -9989> 2016-10-24 18:51:34.341151 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.57:6814/26469 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x1f48aa00 con 0x175bfa20
>>  -9988> 2016-10-24 18:51:34.341162 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.62:6811/26469 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x24456e00 con 0x175bfb80
>>  -9987> 2016-10-24 18:51:34.341174 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.58:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x25c59e00 con 0x7874f20
>>  -9986> 2016-10-24 18:51:34.341186 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.63:6805/2023199 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x19703c00 con 0x7875760
>>  -9985> 2016-10-24 18:51:34.341208 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.58:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x19702600 con 0x26444940
>>  -9984> 2016-10-24 18:51:34.341231 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.63:6803/2023356 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0xa67da00 con 0x7874c60
>>  -9983> 2016-10-24 18:51:34.341249 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.58:6809/2023604 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x22111000 con 0x17887860
>>  -9982> 2016-10-24 18:51:34.341262 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.63:6811/2023604 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x1fe62200 con 0x17887de0
>>  -9981> 2016-10-24 18:51:34.341281 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.58:6802/2023892 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x1fc32c00 con 0x24246100
>>  -9980> 2016-10-24 18:51:34.341297 7f307b22d700  1 -- 10.3.149.62:0/25857
>> --> 10.3.149.63:6801/2023892 -- osd_ping(ping e3014 stamp 2016-10-24
>> 18:51:34.340550) v2 -- ?+0 0x20544c00 con 0x24246d60
>> .
>> .
>> .
>>    -20> 2016-10-24 18:52:05.273121 7f3086243700  1 --
>> 10.3.149.57:6811/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x27c1a600 con 0x1744aaa0
>>    -19> 2016-10-24 18:52:05.273129 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 <== osd.1 10.3.149.60:0/10188 187279 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.212809) v2 ==== 47+0+0
>> (387409057 0 0) 0x1ff4f600 con 0x175b1860
>>    -18> 2016-10-24 18:52:05.273157 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 --> 10.3.149.60:0/10188 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.212809) v2 -- ?+0 0x10d73a00 con 0x175b1860
>>    -17> 2016-10-24 18:52:05.641202 7f3086243700  1 --
>> 10.3.149.57:6811/25857 <== osd.29 10.3.149.59:0/35501 187818 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0
>> (3027252596 0 0) 0x9d0a200 con 0x175172e0
>>    -16> 2016-10-24 18:52:05.641209 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 <== osd.29 10.3.149.59:0/35501 187818 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.640915) v2 ==== 47+0+0
>> (3027252596 0 0) 0xa27ba00 con 0x264422c0
>>    -15> 2016-10-24 18:52:05.641246 7f3086243700  1 --
>> 10.3.149.57:6811/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1b8a6200 con 0x175172e0
>>    -14> 2016-10-24 18:52:05.641290 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 --> 10.3.149.59:0/35501 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.640915) v2 -- ?+0 0x1ff4f600 con 0x264422c0
>>    -13> 2016-10-24 18:52:05.689610 7f3086243700  1 --
>> 10.3.149.57:6811/25857 <== osd.13 10.3.149.56:0/5402 187624 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0
>> (1310408758 0 0) 0x1be24600 con 0x15268b00
>>    -12> 2016-10-24 18:52:05.689664 7f3086243700  1 --
>> 10.3.149.57:6811/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0x9d0a200 con 0x15268b00
>>    -11> 2016-10-24 18:52:05.689661 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 <== osd.13 10.3.149.56:0/5402 187624 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.634215) v2 ==== 47+0+0
>> (1310408758 0 0) 0x19705600 con 0x175b1de0
>>    -10> 2016-10-24 18:52:05.689729 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 --> 10.3.149.56:0/5402 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.634215) v2 -- ?+0 0xa27ba00 con 0x175b1de0
>>     -9> 2016-10-24 18:52:05.861925 7f3086243700  1 --
>> 10.3.149.57:6811/25857 <== osd.4 10.3.149.60:0/12742 187653 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0
>> (350590821 0 0) 0x12169400 con 0x17514000
>>     -8> 2016-10-24 18:52:05.861957 7f3086243700  1 --
>> 10.3.149.57:6811/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x1be24600 con 0x17514000
>>     -7> 2016-10-24 18:52:05.861963 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 <== osd.4 10.3.149.60:0/12742 187653 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.801655) v2 ==== 47+0+0
>> (350590821 0 0) 0x269fba00 con 0x26442840
>>     -6> 2016-10-24 18:52:05.862015 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 --> 10.3.149.60:0/12742 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.801655) v2 -- ?+0 0x19705600 con 0x26442840
>>     -5> 2016-10-24 18:52:05.882605 7f3094bb6700  5 osd.19 3014 tick
>>     -4> 2016-10-24 18:52:05.988572 7f3086243700  1 --
>> 10.3.149.57:6811/25857 <== osd.25 10.3.149.58:0/24382 187898 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0
>> (3778423740 0 0) 0xae91200 con 0x177bb760
>>     -3> 2016-10-24 18:52:05.988582 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 <== osd.25 10.3.149.58:0/24382 187898 ====
>> osd_ping(ping e3014 stamp 2016-10-24 18:52:05.984426) v2 ==== 47+0+0
>> (3778423740 0 0) 0x1a396000 con 0x1526bc80
>>     -2> 2016-10-24 18:52:05.988608 7f3086243700  1 --
>> 10.3.149.57:6811/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x12169400 con 0x177bb760
>>     -1> 2016-10-24 18:52:05.988652 7f3087a46700  1 --
>> 10.3.149.62:6810/25857 --> 10.3.149.58:0/24382 -- osd_ping(ping_reply
>> e3014 stamp 2016-10-24 18:52:05.984426) v2 -- ?+0 0x269fba00 con 0x1526bc80
>>      0> 2016-10-24 18:52:06.216341 7f307c22f700 -1 os/FileStore.cc: In
>> function 'virtual int FileStore::read(coll_t, const ghobject_t&, uint64_t,
>> size_t, ceph::bufferlist&, uint32_t, bool)' thread 7f307c22f700 time
>> 2016-10-24 18:52:06.213123
>> os/FileStore.cc: 2854: FAILED assert(allow_eio || !m_filestore_fail_eio
>> || got != -5)
>>
>>  ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x85) [0xbc9195]
>>  2: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
>> long, ceph::buffer::list&, unsigned int, bool)+0xc94) [0x909f34]
>>  3: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
>> ScrubMap::object&, ThreadPool::TPHandle&)+0x311) [0x9fe0e1]
>>  4: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
>> std::allocator<hobject_t> > const&, bool, unsigned int,
>> ThreadPool::TPHandle&)+0x2e8) [0x8ce8c8]
>>  5: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
>> unsigned int, ThreadPool::TPHandle&)+0x213) [0x7def53]
>>  6: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4c2)
>> [0x7df722]
>>  7: (OSD::RepScrubWQ::_process(MOSDRepScrub*,
>> ThreadPool::TPHandle&)+0xbe) [0x6dcade]
>>  8: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa76) [0xbb9966]
>>  9: (ThreadPool::WorkThread::entry()+0x10) [0xbba9f0]
>>  10: (()+0x7dc5) [0x7f309cd26dc5]
>>  11: (clone()+0x6d) [0x7f309b80821d]
>>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
>> to interpret this.
>>
>> --- logging levels ---
>>    0/ 5 none
>>    0/ 1 lockdep
>>    0/ 1 context
>>    1/ 1 crush
>>    1/ 5 mds
>>    1/ 5 mds_balancer
>>    1/ 5 mds_locker
>>    1/ 5 mds_log
>>    1/ 5 mds_log_expire
>>    1/ 5 mds_migrator
>>    0/ 1 buffer
>>    0/ 1 timer
>>    0/ 1 filer
>>    0/ 1 striper
>>    0/ 1 objecter
>>    0/ 5 rados
>>    0/ 5 rbd
>>    0/ 5 rbd_replay
>>    0/ 5 journaler
>>    0/ 5 objectcacher
>>    0/ 5 client
>>    0/ 5 osd
>>    0/ 5 optracker
>>    0/ 5 objclass
>>    1/ 3 filestore
>>    1/ 3 keyvaluestore
>>    1/ 3 journal
>>    0/ 5 ms
>>    1/ 5 mon
>>    0/10 monc
>>    1/ 5 paxos
>>    0/ 5 tp
>>    1/ 5 auth
>>    1/ 5 crypto
>>    1/ 1 finisher
>>    1/ 5 heartbeatmap
>>    1/ 5 perfcounter
>>    1/ 5 rgw
>>    1/10 civetweb
>>    1/ 5 javaclient
>>    1/ 5 asok
>>    1/ 1 throttle
>>    0/ 0 refs
>>    1/ 5 xio
>>   -2/-2 (syslog threshold)
>>   -1/-1 (stderr threshold)
>>   max_recent     10000
>>   max_new         1000
>>   log_file /var/log/ceph/ceph-osd.19.log
>> --- end dump of recent events ---
>>
>> Since ceph-osd objdump is too large to put in a mail, I will not attach
>> it, but if it is needed i'll find a way to share it. What might be the
>> cause? Can any one help me with this? Thanks.
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to