Do you have any hardware watchdog running in the system? A watchdog could
trigger a powerdown if it meets some value. Any event logs from the chassis
itself?

Kind regards,

Caspar

2018-07-21 10:31 GMT+02:00 Nicolas Huillard <[email protected]>:

> Hi all,
>
> One of my server silently shutdown last night, with no explanation
> whatsoever in any logs. According to the existing logs, the shutdown
> (without reboot) happened between 03:58:20.061452 (last timestamp from
> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> election called, for which oxygene didn't answer).
>
> Is there any way in which Ceph could silently shutdown a server?
> Can SMART self-test influence scrubbing or compaction?
>
> The only thing I have is that smartd stated a long self-test on both
> OSD spinning drives on that host:
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-test in
> progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-test in
> progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], previous
> self-test completed without error
>
> ...and smartctl now says that the self-tests didn't finish (on both
> drives) :
> # 1  Extended offline    Interrupted (host reset)      00%     10636
>     -
>
> MON logs on oxygene talks about rockdb compaction a few minutes before
> the shutdown, and a deep-scrub finished earlier:
> /var/log/ceph/ceph-osd.6.log
> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log [DBG]
> : 6.1d deep-scrub starts
> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log [DBG]
> : 6.1d deep-scrub ok
> 2018-07-21 03:43:36.720707 7fd178082700  0 -- 172.22.0.16:6801/478362 >>
> 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg: challenging authorizer
>
> /var/log/ceph/ceph-mgr.oxygene.log
> 2018-07-21 03:58:16.060137 7fbcd3777700  1 mgr send_beacon standby
> 2018-07-21 03:58:18.060733 7fbcd3777700  1 mgr send_beacon standby
> 2018-07-21 03:58:20.061452 7fbcd3777700  1 mgr send_beacon standby
>
> /var/log/ceph/ceph-mon.oxygene.log
> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log Time
> 2018/07/21-03:52:27.702302) [/build/ceph-12.2.7/src/
> rocksdb/db/db_impl_compaction_flush.cc:1392] [default] Manual compaction
> from level-0 to level-1 from 'mgrstat .. '
> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB
> 1746] Compacting 1@0 + 1@1 files to L1, score -1.00
> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default]
> Compaction start summary: Base version 1745 Base level 0, inputs:
> [149507(602KB)], [149505(13MB)]
> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947702334, "job": 1746, "event":
> "compaction_started", "files_L0": [149507], "files_L1": [149505], "score":
> -1, "input_data_size": 14916379}
> 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB
> 1746] Generated table #149508: 4904 keys, 14808953 bytes
> 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947785565, "cf_name": "default", "job": 1746,
> "event": "table_file_creation", "file_number": 149508, "file_size":
> 14808953, "table_properties": {"data
> 2018-07-21 03:52:27.785627 7f25b5406700  4 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB
> 1746] Compacted 1@0 + 1@1 files to L1 => 14808953 bytes
> 2018-07-21 03:52:27.785656 7f25b5406700  3 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/version_set.cc:2087] More existing
> levels in DB than needed. max_bytes_for_level_multiplier may not be
> guaranteed.
> 2018-07-21 03:52:27.791640 7f25b5406700  4 rocksdb: (Original Log Time
> 2018/07/21-03:52:27.791526) [/build/ceph-12.2.7/src/
> rocksdb/db/compaction_job.cc:621] [default] compacted to: base level 1
> max bytes base 26843546 files[0 1 0 0 0 0 0]
> 2018-07-21 03:52:27.791657 7f25b5406700  4 rocksdb: (Original Log Time
> 2018/07/21-03:52:27.791563) EVENT_LOG_v1 {"time_micros": 1532137947791548,
> "job": 1746, "event": "compaction_finished", "compaction_time_micros":
> 83261, "output_level"
> 2018-07-21 03:52:27.792024 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947792019, "job": 1746, "event":
> "table_file_deletion", "file_number": 149507}
> 2018-07-21 03:52:27.796596 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947796592, "job": 1746, "event":
> "table_file_deletion", "file_number": 149505}
> 2018-07-21 03:52:27.796690 7f25b6408700  4 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:839]
> [default] Manual compaction starting
> ...
> 2018-07-21 03:53:33.404428 7f25b5406700  4 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB
> 1748] Compacted 1@0 + 1@1 files to L1 => 14274825 bytes
> 2018-07-21 03:53:33.404460 7f25b5406700  3 rocksdb:
> [/build/ceph-12.2.7/src/rocksdb/db/version_set.cc:2087] More existing
> levels in DB than needed. max_bytes_for_level_multiplier may not be
> guaranteed.
> 2018-07-21 03:53:33.408360 7f25b5406700  4 rocksdb: (Original Log Time
> 2018/07/21-03:53:33.408228) [/build/ceph-12.2.7/src/
> rocksdb/db/compaction_job.cc:621] [default] compacted to: base level 1
> max bytes base 26843546 files[0 1 0 0 0 0 0]
> 2018-07-21 03:53:33.408381 7f25b5406700  4 rocksdb: (Original Log Time
> 2018/07/21-03:53:33.408275) EVENT_LOG_v1 {"time_micros": 1532138013408255,
> "job": 1748, "event": "compaction_finished", "compaction_time_micros":
> 84964, "output_level"
> 2018-07-21 03:53:33.408647 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532138013408641, "job": 1748, "event":
> "table_file_deletion", "file_number": 149510}
> 2018-07-21 03:53:33.413854 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532138013413849, "job": 1748, "event":
> "table_file_deletion", "file_number": 149508}
> 2018-07-21 03:54:27.634782 7f25bdc17700  0 
> mon.oxygene@3(peon).data_health(66142)
> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
> 2018-07-21 03:55:27.635318 7f25bdc17700  0 
> mon.oxygene@3(peon).data_health(66142)
> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
> 2018-07-21 03:56:27.635923 7f25bdc17700  0 
> mon.oxygene@3(peon).data_health(66142)
> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
> 2018-07-21 03:57:27.636464 7f25bdc17700  0 
> mon.oxygene@3(peon).data_health(66142)
> update_stats avail 79% total 4758 MB, used 991 MB, avail 3766 MB
>
> I can see no evidence of intrusion or anything (network or physical).
> I'm not even sure it was a shutdown more than a hard reset, but no
> evidence of any fsck replaying any journal during reboot either.
> The server restarted without problem and the cluster is now HEALTH_OK.
>
> Hardware:
> * ASRock Rack mobos (the BMC/IPMI may have reset the server for no
> reason)
> * Western Digital ST4000VN008 OSD drives
>
> --
> Nicolas Huillard
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to