[ceph-users] Monitor/OSD report tuning question

Bruce McFarland Sat, 23 Aug 2014 13:27:12 -0700

Hello,
I have a Cluster with 30 OSDs distributed over 3 Storage Servers connected by a 
10G cluster link and connected to the Monitor over 1G. I still have a lot to 
understand with Ceph. Observing the cluster messages in a "ceph -watch" window 
I see a lot of osd "flapping" when it is sitting in a configured state and 
page/placement groups constantly changing status. The cluster was configured 
and came up to 1920 'active + clean' pages.


The 3 status below outputs were issued over the course of about 2 to minutes. 
As you can see there is a lot of activity where I'm assuming the osd reporting 
is occasionally outside the heartbeat TO and various pages/placement groups get 
set to 'stale' and/or 'degrded' but still 'active'. There are osd's being  
marked 'out' in the osd map that I see in the watch window as reported of 
failures that very quickly report "wrongly marked me down". I'm assuming I need 
to 'tune' some of the many TO values so that the osd's and page/placement 
groups all can report within the TO window.


A quick look at the -admin-daemon config show cmd tells me that I might 
consider tuning some of these values:

[root@ceph0 ceph]# ceph --admin-daemon /var/run/ceph/ceph-osd.20.asok config 
show | grep report
  "mon_osd_report_timeout": "900",
  "mon_osd_min_down_reporters": "1",
  "mon_osd_min_down_reports": "3",
  "osd_mon_report_interval_max": "120",
  "osd_mon_report_interval_min": "5",
  "osd_pg_stat_report_interval_max": "500",
[root@ceph0 ceph]#

Which osd and/or mon settings should I increase/decrease to eliminate all this 
state flapping while the cluster sits configured with no data?
Thanks,
Bruce

014-08-23 13:16:15.564932 mon.0 [INF] osd.20 209.243.160.83:6800/20604 failed 
(65 reports from 20 peers after 23.380808 >= grace 21.991016)
2014-08-23 13:16:15.565784 mon.0 [INF] osd.23 209.243.160.83:6810/29727 failed 
(79 reports from 20 peers after 23.675170 >= grace 21.990903)
2014-08-23 13:16:15.566038 mon.0 [INF] osd.25 209.243.160.83:6808/31984 failed 
(65 reports from 20 peers after 23.380921 >= grace 21.991016)
2014-08-23 13:16:15.566206 mon.0 [INF] osd.26 209.243.160.83:6811/518 failed 
(65 reports from 20 peers after 23.381043 >= grace 21.991016)
2014-08-23 13:16:15.566372 mon.0 [INF] osd.27 209.243.160.83:6822/2511 failed 
(65 reports from 20 peers after 23.381195 >= grace 21.991016)
.
.
.
2014-08-23 13:17:09.547684 osd.20 [WRN] map e27128 wrongly marked me down
2014-08-23 13:17:10.826541 osd.23 [WRN] map e27130 wrongly marked me down
2014-08-23 13:20:09.615826 mon.0 [INF] osdmap e27134: 30 osds: 26 up, 30 in
2014-08-23 13:17:10.954121 osd.26 [WRN] map e27130 wrongly marked me down
2014-08-23 13:17:19.125177 osd.25 [WRN] map e27135 wrongly marked me down

[root@ceph-mon01 ceph]# ceph -s
    cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
     health HEALTH_OK
     monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01
     osdmap e26636: 30 osds: 30 up, 30 in
      pgmap v56534: 1920 pgs, 3 pools, 0 bytes data, 0 objects
            26586 MB used, 109 TB / 109 TB avail
                1920 active+clean
[root@ceph-mon01 ceph]# ceph -s
    cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
     health HEALTH_WARN 160 pgs degraded; 83 pgs stale
     monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01
     osdmap e26641: 30 osds: 30 up, 30 in
      pgmap v56545: 1920 pgs, 3 pools, 0 bytes data, 0 objects
            26558 MB used, 109 TB / 109 TB avail
                  83 stale+active+clean
                 160 active+degraded
                1677 active+clean
[root@ceph-mon01 ceph]# ceph -s
    cluster f919f2e4-8e3c-45d1-a2a8-29bc604f9f7d
     health HEALTH_OK
     monmap e1: 1 mons at {ceph-mon01=209.243.160.84:6789/0}, election epoch 2, 
quorum 0 ceph-mon01
     osdmap e26657: 30 osds: 30 up, 30 in
      pgmap v56584: 1920 pgs, 3 pools, 0 bytes data, 0 objects
            26610 MB used, 109 TB / 109 TB avail
                1920 active+clean
[root@ceph-mon01 ceph]#

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Monitor/OSD report tuning question

Reply via email to