I forgot to mention that this server, along with all the other Ceph
servers in my cluster, do not run anything else than Ceph, and each run
 all the Ceph daemons (mon, mgr, mds, 2×osd).

Le samedi 21 juillet 2018 à 10:31 +0200, Nicolas Huillard a écrit :
> Hi all,
> 
> One of my server silently shutdown last night, with no explanation
> whatsoever in any logs. According to the existing logs, the shutdown
> (without reboot) happened between 03:58:20.061452 (last timestamp
> from
> /var/log/ceph/ceph-mgr.oxygene.log) and 03:59:01.515308 (new MON
> election called, for which oxygene didn't answer).
> 
> Is there any way in which Ceph could silently shutdown a server?
> Can SMART self-test influence scrubbing or compaction?
> 
> The only thing I have is that smartd stated a long self-test on both
> OSD spinning drives on that host:
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sda [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdb [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:21:35 oxygene smartd[712]: Device: /dev/sdc [SAT], starting
> scheduled Long Self-Test.
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sda [SAT], self-
> test in progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdb [SAT], self-
> test in progress, 90% remaining
> Jul 21 03:51:35 oxygene smartd[712]: Device: /dev/sdc [SAT], previous
> self-test completed without error
> 
> ...and smartctl now says that the self-tests didn't finish (on both
> drives) :
> # 1  Extended offline    Interrupted (host
> reset)      00%     10636         -
> 
> MON logs on oxygene talks about rockdb compaction a few minutes
> before
> the shutdown, and a deep-scrub finished earlier:
> /var/log/ceph/ceph-osd.6.log
> 2018-07-21 03:32:54.086021 7fd15d82c700  0 log_channel(cluster) log
> [DBG] : 6.1d deep-scrub starts
> 2018-07-21 03:34:31.185549 7fd15d82c700  0 log_channel(cluster) log
> [DBG] : 6.1d deep-scrub ok
> 2018-07-21 03:43:36.720707 7fd178082700  0 -- 172.22.0.16:6801/478362
> > > 172.21.0.16:6800/1459922146 conn(0x556f0642b800 :6801
> 
> s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0
> l=1).handle_connect_msg: challenging authorizer
> 
> /var/log/ceph/ceph-mgr.oxygene.log
> 2018-07-21 03:58:16.060137 7fbcd3777700  1 mgr send_beacon standby
> 2018-07-21 03:58:18.060733 7fbcd3777700  1 mgr send_beacon standby
> 2018-07-21 03:58:20.061452 7fbcd3777700  1 mgr send_beacon standby
> 
> /var/log/ceph/ceph-mon.oxygene.log
> 2018-07-21 03:52:27.702314 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.702302) [/build/ceph-
> 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:1392] [default]
> Manual compaction from level-0 to level-1 from 'mgrstat .. '
> 2018-07-21 03:52:27.702321 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1403] [default] [JOB 1746]
> Compacting 1@0 + 1@1 files to L1, score -1.00
> 2018-07-21 03:52:27.702329 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1407] [default] Compaction
> start summary: Base version 1745 Base level 0, inputs:
> [149507(602KB)], [149505(13MB)]
> 2018-07-21 03:52:27.702348 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947702334, "job": 1746, "event":
> "compaction_started", "files_L0": [149507], "files_L1": [149505],
> "score": -1, "input_data_size": 14916379}
> 2018-07-21 03:52:27.785532 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1116] [default] [JOB 1746]
> Generated table #149508: 4904 keys, 14808953 bytes
> 2018-07-21 03:52:27.785587 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947785565, "cf_name": "default", "job": 1746,
> "event": "table_file_creation", "file_number": 149508, "file_size":
> 14808953, "table_properties": {"data
> 2018-07-21 03:52:27.785627 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1746]
> Compacted 1@0 + 1@1 files to L1 => 14808953 bytes
> 2018-07-21 03:52:27.785656 7f25b5406700  3 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in DB
> than needed. max_bytes_for_level_multiplier may not be guaranteed.
> 2018-07-21 03:52:27.791640 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.791526) [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:621] [default] compacted to:
> base level 1 max bytes base 26843546 files[0 1 0 0 0 0 0]
> 2018-07-21 03:52:27.791657 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:52:27.791563) EVENT_LOG_v1 {"time_micros":
> 1532137947791548, "job": 1746, "event": "compaction_finished",
> "compaction_time_micros": 83261, "output_level"
> 2018-07-21 03:52:27.792024 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947792019, "job": 1746, "event":
> "table_file_deletion", "file_number": 149507}
> 2018-07-21 03:52:27.796596 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532137947796592, "job": 1746, "event":
> "table_file_deletion", "file_number": 149505}
> 2018-07-21 03:52:27.796690 7f25b6408700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/db_impl_compaction_flush.cc:839] [default]
> Manual compaction starting
> ...
> 2018-07-21 03:53:33.404428 7f25b5406700  4 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:1173] [default] [JOB 1748]
> Compacted 1@0 + 1@1 files to L1 => 14274825 bytes
> 2018-07-21 03:53:33.404460 7f25b5406700  3 rocksdb: [/build/ceph-
> 12.2.7/src/rocksdb/db/version_set.cc:2087] More existing levels in DB
> than needed. max_bytes_for_level_multiplier may not be guaranteed.
> 2018-07-21 03:53:33.408360 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:53:33.408228) [/build/ceph-
> 12.2.7/src/rocksdb/db/compaction_job.cc:621] [default] compacted to:
> base level 1 max bytes base 26843546 files[0 1 0 0 0 0 0]
> 2018-07-21 03:53:33.408381 7f25b5406700  4 rocksdb: (Original Log
> Time 2018/07/21-03:53:33.408275) EVENT_LOG_v1 {"time_micros":
> 1532138013408255, "job": 1748, "event": "compaction_finished",
> "compaction_time_micros": 84964, "output_level"
> 2018-07-21 03:53:33.408647 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532138013408641, "job": 1748, "event":
> "table_file_deletion", "file_number": 149510}
> 2018-07-21 03:53:33.413854 7f25b5406700  4 rocksdb: EVENT_LOG_v1
> {"time_micros": 1532138013413849, "job": 1748, "event":
> "table_file_deletion", "file_number": 149508}
> 2018-07-21 03:54:27.634782 7f25bdc17700  0 mon.oxygene@3(peon).data_h
> ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, avail
> 3766 MB
> 2018-07-21 03:55:27.635318 7f25bdc17700  0 mon.oxygene@3(peon).data_h
> ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, avail
> 3766 MB
> 2018-07-21 03:56:27.635923 7f25bdc17700  0 mon.oxygene@3(peon).data_h
> ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, avail
> 3766 MB
> 2018-07-21 03:57:27.636464 7f25bdc17700  0 mon.oxygene@3(peon).data_h
> ealth(66142) update_stats avail 79% total 4758 MB, used 991 MB, avail
> 3766 MB
> 
> I can see no evidence of intrusion or anything (network or physical).
> I'm not even sure it was a shutdown more than a hard reset, but no
> evidence of any fsck replaying any journal during reboot either.
> The server restarted without problem and the cluster is now
> HEALTH_OK.
>                                     
> Hardware:
> * ASRock Rack mobos (the BMC/IPMI may have reset the server for no
> reason)
> * Western Digital ST4000VN008 OSD drives
> 
-- 
Nicolas Huillard
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to