Hi,

We had a monitor drop out of quorum a few weeks back and we have been unable to 
bring it back into sync.

when starting, it synchronises the OSD maps, and then it just restarts from 
fresh every time.

When turning the logging up to log level 20, we see this:


2017-11-07 09:31:57.230333 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 
handle_sync mon_sync(chunk cookie 213137752259 lc 128962692 bl 951895 bytes 
last_key osdmap,1405040) v2
2017-11-07 09:31:57.230336 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 
handle_sync_chunk mon_sync(chunk cookie 213137752259 lc 128962692 bl 951895 
bytes last_key osdmap,1405040) v2
2017-11-07 09:31:57.296945 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 
sync_reset_timeout
2017-11-07 09:31:57.296975 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 
sync_get_next_chunk cookie 213137752259 provider mon.2 10.132.194.132:6789/0
2017-11-07 09:31:58.190967 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 
_ms_dispatch existing session 0x561f3c7f2c40 for mon.2 10.132.194.132:6789/0
2017-11-07 09:31:58.190978 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6  
caps allow *
2017-11-07 09:31:58.190999 7f2da95ff700 20 is_capable service=mon command= read 
on cap allow *
2017-11-07 09:31:58.191003 7f2da95ff700 20  allow so far , doing grant allow *
2017-11-07 09:31:58.191005 7f2da95ff700 20  allow all
2017-11-07 09:31:58.191008 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 
handle_sync mon_sync(chunk cookie 213137752259 lc 128962692 bl 1048394 bytes 
last_key osdmap,883766) v2
2017-11-07 09:31:58.191013 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 
handle_sync_chunk mon_sync(chunk cookie 213137752259 lc 128962692 bl 1048394 
bytes last_key osdmap,883766) v2
2017-11-07 09:31:58.315140 7f2da95ff700 10 mon.ceph-mon-02@1(synchronizing) e6 
sync_reset_timeout
2017-11-07 09:31:58.315170 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 
sync_get_next_chunk cookie 213137752259 provider mon.2 10.132.194.132:6789/0
2017-11-07 09:32:00.418905 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6 
_ms_dispatch existing session 0x561f3c7f2c40 for mon.2 10.132.194.132:6789/0
2017-11-07 09:32:00.418918 7f2da95ff700 20 mon.ceph-mon-02@1(synchronizing) e6  
caps allow *
2017-11-07 09:32:00.418932 7f2da95ff700 20 is_capable service=mon command= read 
on cap allow *
2017-11-07 09:32:00.418936 7f2da95ff700 20  allow so far , doing grant allow *
2017-11-07 09:32:00.418943 7f2da95ff700 20  allow all


As you can see, it reaches the most recent osdmap, then it rolls back to the 
first one.

There is no message in the log to give us any indication as to why it is 
restarting, although I would not be surprised if it is because the monitor is 
too far out of sync when it reaches this point in the sync process.

Unfortunately our cluster has been unhealthy for a while due to effectively 
running out of space. This is why there are so many OSDMaps hanging around at 
the moment. Generally this is tending back towards health (and hence it will 
conduct the mother of all cleanup operations) fairly shortly.

Any help/pointers would be greatly appreciated


[global]
fsid = caaaba57-1ec7-461a-9211-ea166d311820
mon initial members = ceph-mon-01, ceph-mon-03, ceph-mon-05

# We should REALLY use DNS here, eg mon.txc1.us.livelink.io
# librados understands RR DNS
mon host = 10.132.194.128,10.132.194.130,10.132.194.132
mon data avail crit = 1
auth cluster required = cephx
auth service required = cephx
auth client required = cephx

rbd default features = 3
osd_map_message_max = 20

rados osd op timeout = 60

[mon]
# This is a big cluster, we should stay low on PGs
mon_osd_min_up_ratio = 0.7
mon_pg_warn_min_per_osd = 20
mon_pg_warn_max_per_osd = 200
mon_warn_on_legacy_crush_tunables = false


# Prevent flapping of OSDs marking eachother down
mon_osd_min_down_reporters = 10
mon_osd_min_down_reports = 5





Stuart Harland
Infrastructure Engineer
s.harl...@livelinktechnology.net <mailto:n...@livelinktechnology.net>
Tel: +44 (0)207 183 1411


LiveLink Technology Ltd
McCormack House
56A East Street
Havant
PO9 1BS

Please note: Prices quoted are Ex-VAT, offers expire 30 days after issue 
IMPORTANT: The information transmitted in this e-mail is intended only for the 
person or entity to whom it is addressed and may contain confidential and/or 
privileged information. If you are not the intended recipient of this message, 
please do not read, copy, use or disclose this communication and notify the 
sender immediately. Any review, retransmission, dissemination or other use of, 
or taking any action in reliance upon this information by persons or entities 
other than the intended recipient is prohibited. Any views or opinions 
presented in this e-mail are solely those of the author and do not necessarily 
represent those of LiveLink. This e-mail message has been checked for the 
presence of computer viruses. However, LiveLink is not able to accept liability 
for any damage caused by this e-mail.



_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to