I just want to point out that there are many different types of network
issues that don't involve entire networks. Bad nic, bad/loose cable, a
service on a server restarting our modifying the network stack, etc.

That said there are other things that can prevent an mds service, or any
service from responding to the mons and being wings marked down. It happens
to osds enough that they even have the ability to wire in their logs that
they were wrongly marked down. That usually happens when the service is so
busy with an operation that it can't get to the request from the mon fast
enough and it gets marked down. This could also be environment from the mds
server. If something else on the host is using too many resources
preventing the mds service from having what it needs, this could easily
happen.

What level of granularity do you have in your monitoring to tell what your
system state was when this happened? Is there a time of day it is more
likely to happen (expect to find a Cron at that time)?

On Wed, Aug 9, 2017, 8:37 AM Webert de Souza Lima <webert.b...@gmail.com>
wrote:

> Hi,
>
> I recently had a mds outage beucase the mds suicided due to "dne in the
> mds map".
> I've asked it here before and I know that happens because the monitors
> took out this mds from the mds map even though it was alive.
>
> Weird thing, there was no network related issues happening at the time,
> which if there was, it would impact many other systems.
>
> I found this in the mon logs, and i'd like to understand it better:
>  lease_timeout -- calling new election
>
> full logs:
>
> 2017-08-08 23:12:33.286908 7f2b8398d700  1 leveldb: Manual compaction at
> level-1 from 'pgmap_pg\x009.a' @ 1830392430 : 1 .. 'paxos\x0057687834' @ 0
> : 0; will stop at (end)
>
> 2017-08-08 23:12:36.885087 7f2b86f9a700  0 
> mon.bhs1-mail02-ds03@2(peon).data_health(3524)
> update_stats avail 81% total 19555 MB, used 2632 MB, avail 15907 MB
> 2017-08-08 23:13:29.357625 7f2b86f9a700  1 
> mon.bhs1-mail02-ds03@2(peon).paxos(paxos
> updating c 57687834..57688383) lease_timeout -- calling new election
> 2017-08-08 23:13:29.358965 7f2b86799700  0 log_channel(cluster) log [INF]
> : mon.bhs1-mail02-ds03 calling new monitor election
> 2017-08-08 23:13:29.359128 7f2b86799700  1 
> mon.bhs1-mail02-ds03@2(electing).elector(3524)
> init, last seen epoch 3524
> 2017-08-08 23:13:35.383530 7f2b86799700  1 mon.bhs1-mail02-ds03@2(peon).osd
> e12617 e12617: 19 osds: 19 up, 19 in
> 2017-08-08 23:13:35.605839 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
> e18460 print_map
> e18460
> enable_multiple, ever_enabled_multiple: 0,0
> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
>
> Filesystem 'cephfs' (2)
> fs_name cephfs
> epoch   18460
> flags   0
> created 2016-08-01 11:07:47.592124
> modified        2017-07-03 10:32:44.426431
> tableserver     0
> root    0
> session_timeout 60
> session_autoclose       300
> max_file_size   1099511627776
> last_failure    0
> last_failure_osd_epoch  12617
> compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> uses versioned encoding,6=dirfrag is stored in omap,8=file layout v2}
> max_mds 1
> in      0
> up      {0=1574278}
> failed
> damaged
> stopped
> data_pools      8,9
> metadata_pool   7
> inline_data     disabled
> 1574278:        10.0.2.4:6800/2556733458 'd' mds.0.18460 up:replay seq 1
> laggy since 2017-08-08 23:13:35.174109 (standby for rank 0)
>
>
>
> 2017-08-08 23:13:35.606303 7f2b86799700  0 log_channel(cluster) log [INF]
> : mon.bhs1-mail02-ds03 calling new monitor election
> 2017-08-08 23:13:35.606361 7f2b86799700  1 
> mon.bhs1-mail02-ds03@2(electing).elector(3526)
> init, last seen epoch 3526
> 2017-08-08 23:13:36.885540 7f2b86f9a700  0 
> mon.bhs1-mail02-ds03@2(peon).data_health(3528)
> update_stats avail 81% total 19555 MB, used 2636 MB, avail 15903 MB
> 2017-08-08 23:13:38.311777 7f2b86799700  0 mon.bhs1-mail02-ds03@2(peon).mds
> e18461 print_map
>
>
> Regards,
>
> Webert Lima
> DevOps Engineer at MAV Tecnologia
> *Belo Horizonte - Brasil*
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to