Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
So the cluster has been dead and down since around 8/10/2016. I have since rebooted the cluster in order to try and use the new ceph-monstore-tool rebuild functionality. I built the debian packages for the tools for hammer that were recently backported and installed it across all of the servers: root@kh08-8:/home/lacadmin# ceph --version ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c) >From here I ran the following: -- #!/bin/bash set -e store="/home/localadmin/monstore/" rm -rf "${store}" mkdir -p "${store}" for host in kh{08..10}-{1..7}; do rsync -Pav ${store} ${host}:${store} for osd in $(ssh ${host} 'ls /var/lib/ceph/osd/ | grep ceph-*'); do echo "${disk}" ssh ${host} "sudo ceph-objectstore-tool --data-path /var/lib/ceph/osd/${osd} --journal-path /var/lib/ceph/osd/${osd}/journal --op update-mon-db --mon-store-path ${store}" done ssh ${host} "sudo chown lacadmin. ${store}" rsync -Pav ${host}:${store} ${store} done -- Which generated a 1.1G store.db directory >From here I ran the following (per the github guide -- https://github.com/ceph/ceph/blob/master/doc/rados/troubleshooting/troubleshooting-mon.rst ) ceph-authtool ./admin.keyring -n mon. --cap mon 'allow *' ceph-authtool -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap mds 'allow *' which gave me the following key :: -- [mon.] key = caps mon = "allow *" [client.admin] key = caps mds = "allow *" caps mon = "allow *" caps osd = "allow *" -- the above looks like it shouldn't work but going with it. I tried using the monstore tool to rebuild based on the monstore grabbed from all 630 of the osds) but I am met with a dump T_T -- ceph-monstore-tool /home/localadmin/monstore rebuild -- --keyring /home/localadmin/admin.keyring *** Caught signal (Segmentation fault) ** in thread 7f10cd6d88c0 ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c) 1: ceph-monstore-tool() [0x5e960a] 2: (()+0x10330) [0x7f10cc5c8330] 3: (strlen()+0x2a) [0x7f10cac629da] 4: (std::basic_string, std::allocator >::basic_string(char const*, std::allocator const&)+0x25) [0x7f10cb576d75] 5: (rebuild_monstore(char const*, std::vector >&, MonitorDBStore&)+0x878) [0x544958] 6: (main()+0x3e05) [0x52c035] 7: (__libc_start_main()+0xf5) [0x7f10cabfbf45] 8: ceph-monstore-tool() [0x540347] 2017-02-06 17:35:59.885651 7f10cd6d88c0 -1 *** Caught signal (Segmentation fault) ** in thread 7f10cd6d88c0 ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c) 1: ceph-monstore-tool() [0x5e960a] 2: (()+0x10330) [0x7f10cc5c8330] 3: (strlen()+0x2a) [0x7f10cac629da] 4: (std::basic_string, std::allocator >::basic_string(char const*, std::allocator const&)+0x25) [0x7f10cb576d75] 5: (rebuild_monstore(char const*, std::vector >&, MonitorDBStore&)+0x878) [0x544958] 6: (main()+0x3e05) [0x52c035] 7: (__libc_start_main()+0xf5) [0x7f10cabfbf45] 8: ceph-monstore-tool() [0x540347] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -15> 2017-02-06 17:35:54.362066 7f10cd6d88c0 5 asok(0x355a000) register_command perfcounters_dump hook 0x350a0d0 -14> 2017-02-06 17:35:54.362122 7f10cd6d88c0 5 asok(0x355a000) register_command 1 hook 0x350a0d0 -13> 2017-02-06 17:35:54.362137 7f10cd6d88c0 5 asok(0x355a000) register_command perf dump hook 0x350a0d0 -12> 2017-02-06 17:35:54.362147 7f10cd6d88c0 5 asok(0x355a000) register_command perfcounters_schema hook 0x350a0d0 -11> 2017-02-06 17:35:54.362157 7f10cd6d88c0 5 asok(0x355a000) register_command 2 hook 0x350a0d0 -10> 2017-02-06 17:35:54.362161 7f10cd6d88c0 5 asok(0x355a000) register_command perf schema hook 0x350a0d0 -9> 2017-02-06 17:35:54.362170 7f10cd6d88c0 5 asok(0x355a000) register_command perf reset hook 0x350a0d0 -8> 2017-02-06 17:35:54.362179 7f10cd6d88c0 5 asok(0x355a000) register_command config show hook 0x350a0d0 -7> 2017-02-06 17:35:54.362188 7f10cd6d88c0 5 asok(0x355a000) register_command config set hook 0x350a0d0 -6> 2017-02-06 17:35:54.362193 7f10cd6d88c0 5 asok(0x355a000) register_command config get hook 0x350a0d0 -5> 2017-02-06 17:35:54.362202 7f10cd6d88c0 5 asok(0x355a000) register_command config diff hook 0x350a0d0 -4> 2017-02-06 17:35:54.362207 7f10cd6d88c0 5 asok(0x355a000) register_command log flush hook 0x350a0d0 -3> 2017-02-06 17:35:54.362215 7f10cd6d88c0 5 asok(0x355a000) register_command log dump hook 0x350a0d0 -2> 2017-02-06 17:35:54.362220 7f10cd6d88c0 5 asok(0x355a
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
So with a patched leveldb to skip errors I now have a store.db that I can extract the pg,mon,and osd map from. That said when I try to start kh10-8 it bombs out:: --- --- root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8# ceph-mon -i $(hostname) -d 2016-08-13 22:30:54.596039 7fa8b9e088c0 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653 starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf 2016-08-13 22:30:54.608150 7fa8b9e088c0 0 starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf 2016-08-13 22:30:54.608395 7fa8b9e088c0 1 mon.kh10-8@-1(probing) e1 preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf 2016-08-13 22:30:54.608617 7fa8b9e088c0 1 mon.kh10-8@-1(probing).paxosservice(pgmap 0..35606392) refresh upgraded, format 0 -> 1 2016-08-13 22:30:54.608629 7fa8b9e088c0 1 mon.kh10-8@-1(probing).pg v0 on_upgrade discarding in-core PGMap terminate called after throwing an instance of 'ceph::buffer::end_of_buffer' what(): buffer::end_of_buffer *** Caught signal (Aborted) ** in thread 7fa8b9e088c0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph-mon() [0x9b25ea] 2: (()+0x10330) [0x7fa8b8f0b330] 3: (gsignal()+0x37) [0x7fa8b73a8c37] 4: (abort()+0x148) [0x7fa8b73ac028] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535] 6: (()+0x5e6d6) [0x7fa8b7cb16d6] 7: (()+0x5e703) [0x7fa8b7cb1703] 8: (()+0x5e922) [0x7fa8b7cb1922] 9: ceph-mon() [0x853c39] 10: (object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167) [0x894227] 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf] 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3] 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8] 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7] 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a] 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb] 17: (Monitor::init_paxos()+0x85) [0x5b2365] 18: (Monitor::preinit()+0x7d7) [0x5b6f87] 19: (main()+0x230c) [0x57853c] 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45] 21: ceph-mon() [0x59a3c7] 2016-08-13 22:30:54.611791 7fa8b9e088c0 -1 *** Caught signal (Aborted) ** in thread 7fa8b9e088c0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) 1: ceph-mon() [0x9b25ea] 2: (()+0x10330) [0x7fa8b8f0b330] 3: (gsignal()+0x37) [0x7fa8b73a8c37] 4: (abort()+0x148) [0x7fa8b73ac028] 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535] 6: (()+0x5e6d6) [0x7fa8b7cb16d6] 7: (()+0x5e703) [0x7fa8b7cb1703] 8: (()+0x5e922) [0x7fa8b7cb1922] 9: ceph-mon() [0x853c39] 10: (object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167) [0x894227] 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf] 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3] 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8] 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7] 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a] 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb] 17: (Monitor::init_paxos()+0x85) [0x5b2365] 18: (Monitor::preinit()+0x7d7) [0x5b6f87] 19: (main()+0x230c) [0x57853c] 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45] 21: ceph-mon() [0x59a3c7] NOTE: a copy of the executable, or `objdump -rdS ` is needed to interpret this. --- begin dump of recent events --- -33> 2016-08-13 22:30:54.593450 7fa8b9e088c0 5 asok(0x36a20f0) register_command perfcounters_dump hook 0x365a050 -32> 2016-08-13 22:30:54.593480 7fa8b9e088c0 5 asok(0x36a20f0) register_command 1 hook 0x365a050 -31> 2016-08-13 22:30:54.593486 7fa8b9e088c0 5 asok(0x36a20f0) register_command perf dump hook 0x365a050 -30> 2016-08-13 22:30:54.593496 7fa8b9e088c0 5 asok(0x36a20f0) register_command perfcounters_schema hook 0x365a050 -29> 2016-08-13 22:30:54.593499 7fa8b9e088c0 5 asok(0x36a20f0) register_command 2 hook 0x365a050 -28> 2016-08-13 22:30:54.593501 7fa8b9e088c0 5 asok(0x36a20f0) register_command perf schema hook 0x365a050 -27> 2016-08-13 22:30:54.593503 7fa8b9e088c0 5 asok(0x36a20f0) register_command perf reset hook 0x365a050 -26> 2016-08-13 22:30:54.593505 7fa8b9e088c0 5 asok(0x36a20f0) register_command config show hook 0x365a050 -25> 2016-08-13 22:30:54.593508 7fa8b9e088c0 5 asok(0x36a20f0) register_command config set hook 0x365a050 -24> 2016-08-13 22:30:54.593510 7fa8b9e088c0 5 asok(0x36a20f0) register_command config get hook 0x365a050 -23> 2016-08-13 22:30:54.593512 7fa8b9e088c0 5 asok(0x36a20f0) register_command config diff hook 0x365a050 -22> 2016-08-13 22:30:54.593513 7fa8b9e088c0 5 asok(0x36a20f0) register_command log flush hook 0x365a050 -21> 2016-08-13 22:30:54.593557 7fa8b9e088c0 5 asok(0x36a20f0) register_command log dump h
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
A coworker patched leveldb and we were able to export quite a bit of data from kh08's leveldb database. At this point I think I need to re-construct a new leveldb with whatever values I can. Is it the same leveldb database across all 3 montiors? IE will keys exported from one work in the other? All should have the same keys/values although constructed differently right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/ from one host to another right? But can I copy the keys/values from one to another? On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan wrote: > ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in > ceph-test package:: > > I can't seem to get it working :-( dump monmap or any of the commands. > They all bomb out with the same message: > > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool > /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace > Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ > store.db/10882319.ldb > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool > /var/lib/ceph/mon/ceph-kh10-8 dump-keys > Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ > store.db/10882319.ldb > > > I need to clarify as I originally had 2 clusters with this issue and now I > have 1 with all 3 monitors dead and 1 that I was successfully able to > repair. I am about to recap everything I know about the issue and the issue > at hand. Should I start a new email thread about this instead? > > The cluster that is currently having issues is on hammer (94.7), and the > monitor stats are the same:: > root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c > 24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz > ext4 volume comprised of 4x300GB 10k drives in raid 10. > ubuntu 14.04 > > root@kh08-8:~# uname -a > Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC > 2016 x86_64 x86_64 x86_64 GNU/Linux > root@kh08-8:~# ceph --version > ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) > > > From here: Here are the errors I am getting when starting each of the > monitors:: > > > --- > root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d > 2016-08-11 22:15:23.731550 7fe5ad3e98c0 0 ceph version 0.94.7 > (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309 > Corruption: error in middle of record > 2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument > -- > root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d > 2016-08-11 22:14:28.252370 7f7eaab908c0 0 ceph version 0.94.7 > (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30 > Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/ > store.db/10845998.ldb > 2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument > -- > root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon > --cluster=ceph -i kh10-8 -d > 2016-08-11 22:17:54.632762 7f80bf34d8c0 0 ceph version 0.94.7 > (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620 > Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ > store.db/10882319.ldb > 2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data > directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument > --- > > > for kh08, a coworker patched leveldb to print and skip on the first error > and that one is also missing a bunch of files. As such I think kh10-8 is my > most likely candidate to recover but either way recovery is probably not an > option. I see leveldb has a repair.cc (https://github.com/google/lev > eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in > monitor in respect to the dbstore. I tried using the leveldb python module > (plyvel) to attempt a repair but my repl just ends up dying. > > I understand two things:: 1.) Without rebuilding the monitor backend > leveldb (the cluster map as I understand it) store all of the data in the > cluster is essentialy lost (right?) > 2.) it is possible to rebuild > this database via some form of magic or (source)ry as all of this data is > essential held throughout the cluster as well. > > We only use radosgw / S3 for this cluster. If there is a way to recover my > data that is easier//more likely than rebuilding the leveldb of a monitor > and starting a single monitor cluster up I would like to switch gears and > focus on that. > > Looking at the dev docs: > http://docs.ceph.com/docs/hammer/architecture/#cluster-map > it has 5 main parts:: > > ``` > The Monitor Map: Contains the cluster fsid, the position, name address and > port of each monitor. It also indicates the current epoch, when the map was > created, and the last time it changed. To view a monit
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in ceph-test package:: I can't seem to get it working :-( dump monmap or any of the commands. They all bomb out with the same message: root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ store.db/10882319.ldb root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool /var/lib/ceph/mon/ceph-kh10-8 dump-keys Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ store.db/10882319.ldb I need to clarify as I originally had 2 clusters with this issue and now I have 1 with all 3 monitors dead and 1 that I was successfully able to repair. I am about to recap everything I know about the issue and the issue at hand. Should I start a new email thread about this instead? The cluster that is currently having issues is on hammer (94.7), and the monitor stats are the same:: root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c 24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz ext4 volume comprised of 4x300GB 10k drives in raid 10. ubuntu 14.04 root@kh08-8:~# uname -a Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux root@kh08-8:~# ceph --version ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432) >From here: Here are the errors I am getting when starting each of the monitors:: --- root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d 2016-08-11 22:15:23.731550 7fe5ad3e98c0 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309 Corruption: error in middle of record 2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument -- root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d 2016-08-11 22:14:28.252370 7f7eaab908c0 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30 Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/ store.db/10845998.ldb 2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument -- root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon --cluster=ceph -i kh10-8 -d 2016-08-11 22:17:54.632762 7f80bf34d8c0 0 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620 Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/ store.db/10882319.ldb 2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument --- for kh08, a coworker patched leveldb to print and skip on the first error and that one is also missing a bunch of files. As such I think kh10-8 is my most likely candidate to recover but either way recovery is probably not an option. I see leveldb has a repair.cc (https://github.com/google/lev eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in monitor in respect to the dbstore. I tried using the leveldb python module (plyvel) to attempt a repair but my repl just ends up dying. I understand two things:: 1.) Without rebuilding the monitor backend leveldb (the cluster map as I understand it) store all of the data in the cluster is essentialy lost (right?) 2.) it is possible to rebuild this database via some form of magic or (source)ry as all of this data is essential held throughout the cluster as well. We only use radosgw / S3 for this cluster. If there is a way to recover my data that is easier//more likely than rebuilding the leveldb of a monitor and starting a single monitor cluster up I would like to switch gears and focus on that. Looking at the dev docs: http://docs.ceph.com/docs/hammer/architecture/#cluster-map it has 5 main parts:: ``` The Monitor Map: Contains the cluster fsid, the position, name address and port of each monitor. It also indicates the current epoch, when the map was created, and the last time it changed. To view a monitor map, execute ceph mon dump. The OSD Map: Contains the cluster fsid, when the map was created and last modified, a list of pools, replica sizes, PG numbers, a list of OSDs and their status (e.g., up, in). To view an OSD map, execute ceph osd dump. The PG Map: Contains the PG version, its time stamp, the last OSD map epoch, the full ratios, and details on each placement group such as the PG ID, the Up Set, the Acting Set, the state of the PG (e.g., active + clean), and data usage statistics for each pool. The CRUSH Map: Contains a list of storage devices, the failure domain hierarchy (e.g., device, host, rack, row, room, etc.), and rules for traversing the hierarchy when storing data. To view a CRUSH map, execute ceph osd getcrushmap -o {fil
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
> Op 11 augustus 2016 om 15:17 schreef Sean Sullivan : > > > Hello Wido, > > Thanks for the advice. While the data center has a/b circuits and > redundant power, etc if a ground fault happens it travels outside and > fails causing the whole building to fail (apparently). > > The monitors are each the same with > 2x e5 cpus > 64gb of ram > 4x 300gb 10k SAS drives in raid 10 (write through mode). > Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 - > 3am CST) > Ceph hammer LTS 0.94.7 > > (we are still working on our jewel test cluster so it is planned but not in > place yet) > > The only thing that seems to be corrupt is the monitors leveldb store. I > see multiple issues on Google leveldb github from March 2016 about fsync > and power failure so I assume this is an issue with leveldb. > > I have backed up /var/lib/ceph/Mon on all of my monitors before trying to > proceed with any form of recovery. > > Is there any way to reconstruct the leveldb or replace the monitors and > recover the data? > I don't know. I have never done it. Other people might know this better than me. Maybe 'ceph-monstore-tool' can help you? Wido > I found the following post in which sage says it is tedious but possible. ( > http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if > I have any chance of doing it. I have the fsid, the Mon key map and all of > the osds look to be fine so all of the previous osd maps are there. > > I just don't understand what key/values I need inside. > > On Aug 11, 2016 1:33 AM, "Wido den Hollander" wrote: > > > > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > > seapasu...@uchicago.edu>: > > > > > > > > > I think it just got worse:: > > > > > > all three monitors on my other cluster say that ceph-mon can't open > > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose > > all > > > 3 monitors? I saw a post by Sage saying that the data can be recovered as > > > all of the data is held on other servers. Is this possible? If so has > > > anyone had any experience doing so? > > > > I have never done so, so I couldn't tell you. > > > > However, it is weird that on all three it got corrupted. What hardware are > > you using? Was it properly protected against power failure? > > > > If you mon store is corrupted I'm not sure what might happen. > > > > However, make a backup of ALL monitors right now before doing anything. > > > > Wido > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
Hello Wido, Thanks for the advice. While the data center has a/b circuits and redundant power, etc if a ground fault happens it travels outside and fails causing the whole building to fail (apparently). The monitors are each the same with 2x e5 cpus 64gb of ram 4x 300gb 10k SAS drives in raid 10 (write through mode). Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 - 3am CST) Ceph hammer LTS 0.94.7 (we are still working on our jewel test cluster so it is planned but not in place yet) The only thing that seems to be corrupt is the monitors leveldb store. I see multiple issues on Google leveldb github from March 2016 about fsync and power failure so I assume this is an issue with leveldb. I have backed up /var/lib/ceph/Mon on all of my monitors before trying to proceed with any form of recovery. Is there any way to reconstruct the leveldb or replace the monitors and recover the data? I found the following post in which sage says it is tedious but possible. ( http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if I have any chance of doing it. I have the fsid, the Mon key map and all of the osds look to be fine so all of the previous osd maps are there. I just don't understand what key/values I need inside. On Aug 11, 2016 1:33 AM, "Wido den Hollander" wrote: > > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan < > seapasu...@uchicago.edu>: > > > > > > I think it just got worse:: > > > > all three monitors on my other cluster say that ceph-mon can't open > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose > all > > 3 monitors? I saw a post by Sage saying that the data can be recovered as > > all of the data is held on other servers. Is this possible? If so has > > anyone had any experience doing so? > > I have never done so, so I couldn't tell you. > > However, it is weird that on all three it got corrupted. What hardware are > you using? Was it properly protected against power failure? > > If you mon store is corrupted I'm not sure what might happen. > > However, make a backup of ALL monitors right now before doing anything. > > Wido > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
I'm guessing you had writeback cache enabled on ceph-mon disk (smartctl -g wcache /dev/sdX) and disk firmware did not care about respecting flush semantics. On 11.08.2016 08:33, Wido den Hollander wrote: > >> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan : >> >> >> I think it just got worse:: >> >> all three monitors on my other cluster say that ceph-mon can't open >> /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all >> 3 monitors? I saw a post by Sage saying that the data can be recovered as >> all of the data is held on other servers. Is this possible? If so has >> anyone had any experience doing so? > > I have never done so, so I couldn't tell you. > > However, it is weird that on all three it got corrupted. What hardware are > you using? Was it properly protected against power failure? > > If you mon store is corrupted I'm not sure what might happen. > > However, make a backup of ALL monitors right now before doing anything. > > Wido > >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- Tomasz Kuzemko tomasz.kuze...@corp.ovh.com signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now
> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan : > > > I think it just got worse:: > > all three monitors on my other cluster say that ceph-mon can't open > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all > 3 monitors? I saw a post by Sage saying that the data can be recovered as > all of the data is held on other servers. Is this possible? If so has > anyone had any experience doing so? I have never done so, so I couldn't tell you. However, it is weird that on all three it got corrupted. What hardware are you using? Was it properly protected against power failure? If you mon store is corrupted I'm not sure what might happen. However, make a backup of ALL monitors right now before doing anything. Wido > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com