[ceph-users] Read Errors and OSD Flapping

Nick Fisk Sat, 30 May 2015 14:28:00 -0700

Hi All,


I was noticing poor performance on my cluster and when I went to investigate
I noticed OSD 29 was flapping up and down. On investigation it looks like it
has 2 pending sectors, kernel log is filled with the following

 

end_request: critical medium error, dev sdk, sector 4483365656

end_request: critical medium error, dev sdk, sector 4483365872

 

I can see in the OSD logs that it looked like when the OSD was crashing it
was trying to scrub the PG, probably failing when the kernel passes up the
read error. 

 

ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)

1: /usr/bin/ceph-osd() [0xacaf4a]

2: (()+0x10340) [0x7fdc43032340]

3: (gsignal()+0x39) [0x7fdc414d1cc9]

4: (abort()+0x148) [0x7fdc414d50d8]

5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fdc41ddc6b5]

6: (()+0x5e836) [0x7fdc41dda836]

7: (()+0x5e863) [0x7fdc41dda863]

8: (()+0x5eaa2) [0x7fdc41ddaaa2]

9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0xbc2908]

10: (FileStore::read(coll_t, ghobject_t const&, unsigned long, unsigned
long, ceph::buffer::list&, unsigned int, bool)+0xc98) [0x9168e

8]

11: (ReplicatedBackend::be_deep_scrub(hobject_t const&, unsigned int,
ScrubMap::object&, ThreadPool::TPHandle&)+0x2f9) [0xa05bf9]

12: (PGBackend::be_scan_list(ScrubMap&, std::vector<hobject_t,
std::allocator<hobject_t> > const&, bool, unsigned int, ThreadPool::TPH

andle&)+0x2c8) [0x8dab98]

13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t, bool,
unsigned int, ThreadPool::TPHandle&)+0x1fa) [0x7f099a]

14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x4a2)
[0x7f1132]

15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xbe)
[0x6e583e]

16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa5e) [0xbb38ae]

17: (ThreadPool::WorkThread::entry()+0x10) [0xbb4950]

18: (()+0x8182) [0x7fdc4302a182]

19: (clone()+0x6d) [0x7fdc4159547d]

 

Few questions: 

1.       Is this the expected behaviour, or should Ceph try and do something
to either keep the OSD down or rewrite the sector to cause a sector remap?

2.       I am monitoring smart stats, but is there any other way of picking
this up or getting Ceph to highlight it? Something like a flapping OSD
notification would be nice.

3.       I'm assuming at this stage this disk will not be replaceable under
warranty, am I best to mark it as out, let it drain and then re-introduce it
again, which should overwrite the sector and cause a remap? Or is there a
better way?

 

Many Thanks,

Nick

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Read Errors and OSD Flapping

Reply via email to