Thanks!

I knew about noscrub, but I didn't realize that the flapping would cancel a scrub in progress.


So the scrub doesn't appear to be the reason it wasn't recovering. After a flap, it goes into: 2014-04-02 14:11:09.776810 mon.0 [INF] pgmap v5323181: 2592 pgs: 2591 active+clean, 1 active+recovery_wait; 15066 GB data, 30527 GB used, 29060 GB / 59588 GB avail; 1/36666878 objects degraded (0.000%); 0 B/s, 11 keys/s, 2 objects/s recovering

It stays in that state until the OSD gets kicked out again.


The problem is the flapping OSD is spamming its logs with:
2014-04-02 14:12:01.242425 7f344a97d700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f3447977700' had timed out after 15

None of the other OSDs are saying that.

Is there anything I can do to repair the health map on osd.11?




In case it helps, here are the osd.11 logs after a daemon restart:
2014-04-02 14:10:58.267556 7f3467ff6780 0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-osd, pid 7791 2014-04-02 14:10:58.269782 7f3467ff6780 1 filestore(/var/lib/ceph/osd/ceph-11) mount detected xfs 2014-04-02 14:10:58.269789 7f3467ff6780 1 filestore(/var/lib/ceph/osd/ceph-11) disabling 'filestore replica fadvise' due to known issues with fadvise(DONTNEED) on xfs 2014-04-02 14:10:58.306112 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work 2014-04-02 14:10:58.306135 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-04-02 14:10:58.308070 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-04-02 14:10:58.357102 7f3467ff6780 0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2014-04-02 14:10:58.360837 7f3467ff6780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-04-02 14:10:58.360851 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 20: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.422842 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 20: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.423241 7f3467ff6780 1 journal close /var/lib/ceph/osd/ceph-11/journal 2014-04-02 14:10:58.424433 7f3467ff6780 1 filestore(/var/lib/ceph/osd/ceph-11) mount detected xfs 2014-04-02 14:10:58.442963 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is supported and appears to work 2014-04-02 14:10:58.442974 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-04-02 14:10:58.445144 7f3467ff6780 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-11) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-04-02 14:10:58.451977 7f3467ff6780 0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: checkpoint is not enabled 2014-04-02 14:10:58.454481 7f3467ff6780 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-04-02 14:10:58.454495 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 21: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.465211 7f3467ff6780 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 21: 6442450944 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-04-02 14:10:58.466825 7f3467ff6780 0 <cls> cls/hello/cls_hello.cc:271: loading cls_hello 2014-04-02 14:10:58.468745 7f3467ff6780 0 osd.11 11688 crush map has features 1073741824, adjusting msgr requires for clients 2014-04-02 14:10:58.468756 7f3467ff6780 0 osd.11 11688 crush map has features 1073741824, adjusting msgr requires for osds 2014-04-02 14:11:07.822045 7f343de58700 0 -- 10.194.0.7:6800/7791 >> 10.194.0.7:6822/14075 pipe(0x1c96e000 sd=177 :6800 s=0 pgs=0 cs=0 l=0 c=0x1b7e3000).accept connect_seq 0 vs existing 0 state connecting 2014-04-02 14:11:07.822182 7f343f973700 0 -- 10.194.0.7:6800/7791 >> 10.194.0.7:6806/26942 pipe(0x1c96e280 sd=82 :6800 s=0 pgs=0 cs=0 l=0 c=0x1b7e3160).accept connect_seq 0 vs existing 0 state connecting 2014-04-02 14:11:20.333163 7f344a97d700 1 heartbeat_map is_healthy 'OSD::op_tp thread 0x7f3447977700' had timed out after 15
<snip repeats>
2014-04-02 14:13:35.310407 7f344a97d700 -1 common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f344a97d700 time 2014-04-02 14:13:35.308718
common/HeartbeatMap.cc: 79: FAILED assert(0 == "hit suicide timeout")

 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
1: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d*, char const*, long)+0x107) [0x89df87]
 2: (ceph::HeartbeatMap::is_healthy()+0xa7) [0x89e937]
 3: (OSD::handle_osd_ping(MOSDPing*)+0x85b) [0x60f79b]
 4: (OSD::heartbeat_dispatch(Message*)+0x4d3) [0x610a23]
 5: (DispatchQueue::entry()+0x549) [0x9f4aa9]
 6: (DispatchQueue::DispatchThread::entry()+0xd) [0x92ffdd]
 7: (()+0x7e9a) [0x7f346721ae9a]
 8: (clone()+0x6d) [0x7f3465cbe3fd]

All of the other OSDs are spamming:
2014-04-02 14:15:45.275858 7f4eb1e35700 -1 osd.7 11697 heartbeat_check: no reply from osd.11 since back 2014-04-02 14:13:56.927261 front 2014-04-02 14:13:56.927261 (cutoff 2014-04-02 14:15:25.275855)






*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com <mailto:cle...@centraldesktop.com>

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website <http://www.centraldesktop.com/> | Twitter <http://www.twitter.com/centraldesktop> | Facebook <http://www.facebook.com/CentralDesktop> | LinkedIn <http://www.linkedin.com/groups?gid=147417> | Blog <http://cdblog.centraldesktop.com/>

On 4/2/14 13:38 , Sage Weil wrote:
On Wed, 2 Apr 2014, Craig Lewis wrote:
Is there any way to cancel a scrub on a PG?


I have an OSD that's recovering, and there's a single PG left waiting:
2014-04-02 13:15:39.868994 mon.0 [INF] pgmap v5322756: 2592 pgs: 2589
active+clean, 1 active+recovery_wait, 2 active+clean+scrubbing+deep; 15066 GB
data, 30527 GB used, 29061 GB / 59588 GB avail; 1/36666878 objects degraded
(0.000%)

The PG that is in recovery_wait is on the same OSD that is being deep
scrubbed.  I don't have journals on SSD, so recovery and scrubbing are heavily
throttled.  I want to cancel the scrub so the recovery can complete.  I'll
manually restart the deep scrub when it's done.

Normally I'd just wait, but this OSD is flapping.  It keeps getting kicked out
of the cluster for being unresponsive.  I'm hoping that if I cancel the scrub,
it will allow the recovery to complete and the OSD will stop flapping.
You can 'ceph osd set noscrub' to prevent a new scrub from starting.
Next time it flaps the scrub won't restart.  The only want to cancel an
inprogress scrub is to force a peering event, usually by manually marking
the osd down (ceph osd down N).

sage

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to