subject:"Re\: \[ceph\-users\] Fwd\: lost power. monitors died. Cephx errors now"

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2017-02-06 Thread Sean Sullivan

So the cluster has been dead and down since around 8/10/2016. I have since
rebooted the cluster in order to try and use the new ceph-monstore-tool
rebuild functionality.

I built the debian packages for the tools for hammer that were recently
backported and installed it across all of the servers:

root@kh08-8:/home/lacadmin# ceph --version
ceph version 0.94.9-4530-g83af8cd (83af8cdaaa6d94404e6146b68e532a784e3cc99c)


>From here I ran the following:
--
#!/bin/bash
set -e
store="/home/localadmin/monstore/"

rm -rf "${store}"
mkdir -p "${store}"



for host in kh{08..10}-{1..7};
do
rsync -Pav ${store} ${host}:${store}
for osd in $(ssh ${host} 'ls /var/lib/ceph/osd/ | grep ceph-*');
do
echo "${disk}"
ssh ${host} "sudo ceph-objectstore-tool --data-path
/var/lib/ceph/osd/${osd} --journal-path /var/lib/ceph/osd/${osd}/journal
--op update-mon-db --mon-store-path ${store}"
done
ssh ${host} "sudo chown lacadmin. ${store}"
rsync -Pav ${host}:${store} ${store}
done
--

Which generated a 1.1G store.db directory

>From here I ran the following (per the github guide --
https://github.com/ceph/ceph/blob/master/doc/rados/troubleshooting/troubleshooting-mon.rst
)

ceph-authtool ./admin.keyring -n mon. --cap mon 'allow *'
ceph-authtool -n client.admin --cap mon 'allow *' --cap osd 'allow *' --cap
mds 'allow *'

which gave me the following key ::
--
[mon.]
key = 
caps mon = "allow *"
[client.admin]
key = 
caps mds = "allow *"
caps mon = "allow *"
caps osd = "allow *"
--

the above looks like it shouldn't work but going with it. I tried using the
monstore tool to rebuild based on the monstore grabbed from all 630 of the
osds) but I am met with a dump T_T

--
ceph-monstore-tool /home/localadmin/monstore rebuild -- --keyring
/home/localadmin/admin.keyring

*** Caught signal (Segmentation fault) **
 in thread 7f10cd6d88c0
 ceph version 0.94.9-4530-g83af8cd
(83af8cdaaa6d94404e6146b68e532a784e3cc99c)
 1: ceph-monstore-tool() [0x5e960a]
 2: (()+0x10330) [0x7f10cc5c8330]
 3: (strlen()+0x2a) [0x7f10cac629da]
 4: (std::basic_string::basic_string(char const*, std::allocator const&)+0x25)
[0x7f10cb576d75]
 5: (rebuild_monstore(char const*, std::vector&, MonitorDBStore&)+0x878) [0x544958]
 6: (main()+0x3e05) [0x52c035]
 7: (__libc_start_main()+0xf5) [0x7f10cabfbf45]
 8: ceph-monstore-tool() [0x540347]
2017-02-06 17:35:59.885651 7f10cd6d88c0 -1 *** Caught signal (Segmentation
fault) **
 in thread 7f10cd6d88c0

 ceph version 0.94.9-4530-g83af8cd
(83af8cdaaa6d94404e6146b68e532a784e3cc99c)
 1: ceph-monstore-tool() [0x5e960a]
 2: (()+0x10330) [0x7f10cc5c8330]
 3: (strlen()+0x2a) [0x7f10cac629da]
 4: (std::basic_string::basic_string(char const*, std::allocator const&)+0x25)
[0x7f10cb576d75]
 5: (rebuild_monstore(char const*, std::vector&, MonitorDBStore&)+0x878) [0x544958]
 6: (main()+0x3e05) [0x52c035]
 7: (__libc_start_main()+0xf5) [0x7f10cabfbf45]
 8: ceph-monstore-tool() [0x540347]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
   -15> 2017-02-06 17:35:54.362066 7f10cd6d88c0  5 asok(0x355a000)
register_command perfcounters_dump hook 0x350a0d0
   -14> 2017-02-06 17:35:54.362122 7f10cd6d88c0  5 asok(0x355a000)
register_command 1 hook 0x350a0d0
   -13> 2017-02-06 17:35:54.362137 7f10cd6d88c0  5 asok(0x355a000)
register_command perf dump hook 0x350a0d0
   -12> 2017-02-06 17:35:54.362147 7f10cd6d88c0  5 asok(0x355a000)
register_command perfcounters_schema hook 0x350a0d0
   -11> 2017-02-06 17:35:54.362157 7f10cd6d88c0  5 asok(0x355a000)
register_command 2 hook 0x350a0d0
   -10> 2017-02-06 17:35:54.362161 7f10cd6d88c0  5 asok(0x355a000)
register_command perf schema hook 0x350a0d0
-9> 2017-02-06 17:35:54.362170 7f10cd6d88c0  5 asok(0x355a000)
register_command perf reset hook 0x350a0d0
-8> 2017-02-06 17:35:54.362179 7f10cd6d88c0  5 asok(0x355a000)
register_command config show hook 0x350a0d0
-7> 2017-02-06 17:35:54.362188 7f10cd6d88c0  5 asok(0x355a000)
register_command config set hook 0x350a0d0
-6> 2017-02-06 17:35:54.362193 7f10cd6d88c0  5 asok(0x355a000)
register_command config get hook 0x350a0d0
-5> 2017-02-06 17:35:54.362202 7f10cd6d88c0  5 asok(0x355a000)
register_command config diff hook 0x350a0d0
-4> 2017-02-06 17:35:54.362207 7f10cd6d88c0  5 asok(0x355a000)
register_command log flush hook 0x350a0d0
-3> 2017-02-06 17:35:54.362215 7f10cd6d88c0  5 asok(0x355a000)

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-13 Thread Sean Sullivan

So with a patched leveldb to skip errors I now have a store.db that I can
extract the pg,mon,and osd map from. That said when I try to start kh10-8
it bombs out::

---
---
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8# ceph-mon -i $(hostname) -d
2016-08-13 22:30:54.596039 7fa8b9e088c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 708653
starting mon.kh10-8 rank 2 at 10.64.64.125:6789/0 mon_data
/var/lib/ceph/mon/ceph-kh10-8 fsid e452874b-cb29-4468-ac7f-f8901dfccebf
2016-08-13 22:30:54.608150 7fa8b9e088c0  0 starting mon.kh10-8 rank 2 at
10.64.64.125:6789/0 mon_data /var/lib/ceph/mon/ceph-kh10-8 fsid
e452874b-cb29-4468-ac7f-f8901dfccebf
2016-08-13 22:30:54.608395 7fa8b9e088c0  1 mon.kh10-8@-1(probing) e1
preinit fsid e452874b-cb29-4468-ac7f-f8901dfccebf
2016-08-13 22:30:54.608617 7fa8b9e088c0  1
mon.kh10-8@-1(probing).paxosservice(pgmap
0..35606392) refresh upgraded, format 0 -> 1
2016-08-13 22:30:54.608629 7fa8b9e088c0  1 mon.kh10-8@-1(probing).pg v0
on_upgrade discarding in-core PGMap
terminate called after throwing an instance of 'ceph::buffer::end_of_buffer'
  what():  buffer::end_of_buffer
*** Caught signal (Aborted) **
 in thread 7fa8b9e088c0
 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: ceph-mon() [0x9b25ea]
 2: (()+0x10330) [0x7fa8b8f0b330]
 3: (gsignal()+0x37) [0x7fa8b73a8c37]
 4: (abort()+0x148) [0x7fa8b73ac028]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535]
 6: (()+0x5e6d6) [0x7fa8b7cb16d6]
 7: (()+0x5e703) [0x7fa8b7cb1703]
 8: (()+0x5e922) [0x7fa8b7cb1922]
 9: ceph-mon() [0x853c39]
 10:
(object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167)
[0x894227]
 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf]
 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3]
 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8]
 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7]
 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a]
 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb]
 17: (Monitor::init_paxos()+0x85) [0x5b2365]
 18: (Monitor::preinit()+0x7d7) [0x5b6f87]
 19: (main()+0x230c) [0x57853c]
 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45]
 21: ceph-mon() [0x59a3c7]
2016-08-13 22:30:54.611791 7fa8b9e088c0 -1 *** Caught signal (Aborted) **
 in thread 7fa8b9e088c0

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: ceph-mon() [0x9b25ea]
 2: (()+0x10330) [0x7fa8b8f0b330]
 3: (gsignal()+0x37) [0x7fa8b73a8c37]
 4: (abort()+0x148) [0x7fa8b73ac028]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fa8b7cb3535]
 6: (()+0x5e6d6) [0x7fa8b7cb16d6]
 7: (()+0x5e703) [0x7fa8b7cb1703]
 8: (()+0x5e922) [0x7fa8b7cb1922]
 9: ceph-mon() [0x853c39]
 10:
(object_stat_collection_t::decode(ceph::buffer::list::iterator&)+0x167)
[0x894227]
 11: (pg_stat_t::decode(ceph::buffer::list::iterator&)+0x5ff) [0x894baf]
 12: (PGMap::update_pg(pg_t, ceph::buffer::list&)+0xa3) [0x91a8d3]
 13: (PGMonitor::read_pgmap_full()+0x1d8) [0x68b9b8]
 14: (PGMonitor::update_from_paxos(bool*)+0xbf7) [0x6977b7]
 15: (PaxosService::refresh(bool*)+0x19a) [0x605b5a]
 16: (Monitor::refresh_from_paxos(bool*)+0x1db) [0x5b1ffb]
 17: (Monitor::init_paxos()+0x85) [0x5b2365]
 18: (Monitor::preinit()+0x7d7) [0x5b6f87]
 19: (main()+0x230c) [0x57853c]
 20: (__libc_start_main()+0xf5) [0x7fa8b7393f45]
 21: ceph-mon() [0x59a3c7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
   -33> 2016-08-13 22:30:54.593450 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perfcounters_dump hook 0x365a050
   -32> 2016-08-13 22:30:54.593480 7fa8b9e088c0  5 asok(0x36a20f0)
register_command 1 hook 0x365a050
   -31> 2016-08-13 22:30:54.593486 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perf dump hook 0x365a050
   -30> 2016-08-13 22:30:54.593496 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perfcounters_schema hook 0x365a050
   -29> 2016-08-13 22:30:54.593499 7fa8b9e088c0  5 asok(0x36a20f0)
register_command 2 hook 0x365a050
   -28> 2016-08-13 22:30:54.593501 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perf schema hook 0x365a050
   -27> 2016-08-13 22:30:54.593503 7fa8b9e088c0  5 asok(0x36a20f0)
register_command perf reset hook 0x365a050
   -26> 2016-08-13 22:30:54.593505 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config show hook 0x365a050
   -25> 2016-08-13 22:30:54.593508 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config set hook 0x365a050
   -24> 2016-08-13 22:30:54.593510 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config get hook 0x365a050
   -23> 2016-08-13 22:30:54.593512 7fa8b9e088c0  5 asok(0x36a20f0)
register_command config diff hook 0x365a050
   -22> 2016-08-13 22:30:54.593513 7fa8b9e088c0  5 asok(0x36a20f0)
register_command log flush hook 0x365a050
   -21> 2016-08-13 22:30:54.593557 7fa8b9e088c0  5 asok(0x36a20f0)
register_command log dump

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan

A coworker patched leveldb and we were able to export quite a bit of data
from kh08's leveldb database. At this point I think I need to re-construct
a new leveldb with whatever values I can. Is it the same leveldb database
across all 3 montiors? IE will keys exported from one work in the other?
All should have the same keys/values although constructed differently
right? I can't blindly copy /var/lib/ceph/mon/ceph-$(hostname)/store.db/
from one host to another right? But can I copy the keys/values from one to
another?

On Fri, Aug 12, 2016 at 12:45 PM, Sean Sullivan 
wrote:

> ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in
> ceph-test package::
>
> I can't seem to get it working :-( dump monmap or any of the commands.
> They all bomb out with the same message:
>
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
> /var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
> /var/lib/ceph/mon/ceph-kh10-8 dump-keys
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
>
>
> I need to clarify as I originally had 2 clusters with this issue and now I
> have 1 with all 3 monitors dead and 1 that I was successfully able to
> repair. I am about to recap everything I know about the issue and the issue
> at hand. Should I start a new email thread about this instead?
>
> The cluster that is currently having issues is on hammer (94.7), and the
> monitor stats are the same::
> root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c
>  24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
>  ext4 volume comprised of 4x300GB 10k drives in raid 10.
>  ubuntu 14.04
>
> root@kh08-8:~# uname -a
> Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC
> 2016 x86_64 x86_64 x86_64 GNU/Linux
> root@kh08-8:~# ceph --version
> ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
>
>
> From here: Here are the errors I am getting when starting each of the
> monitors::
>
>
> ---
> root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d
> 2016-08-11 22:15:23.731550 7fe5ad3e98c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309
> Corruption: error in middle of record
> 2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument
> --
> root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d
> 2016-08-11 22:14:28.252370 7f7eaab908c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30
> Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/
> store.db/10845998.ldb
> 2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument
> --
> root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon
> --cluster=ceph -i kh10-8 -d
> 2016-08-11 22:17:54.632762 7f80bf34d8c0  0 ceph version 0.94.7
> (d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620
> Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
> store.db/10882319.ldb
> 2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data
> directory at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument
> ---
>
>
> for kh08, a coworker patched leveldb to print and skip on the first error
> and that one is also missing a bunch of files. As such I think kh10-8 is my
> most likely candidate to recover but either way recovery is probably not an
> option. I see leveldb has a repair.cc (https://github.com/google/lev
> eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in
> monitor in respect to the dbstore. I tried using the leveldb python module
> (plyvel) to attempt a repair but my repl just ends up dying.
>
> I understand two things:: 1.) Without rebuilding the monitor backend
> leveldb (the cluster map as I understand it) store all of the data in the
> cluster is essentialy lost (right?)
>  2.) it is possible to rebuild
> this database via some form of magic or (source)ry as all of this data is
> essential held throughout the cluster as well.
>
> We only use radosgw / S3 for this cluster. If there is a way to recover my
> data that is easier//more likely than rebuilding the leveldb of a monitor
> and starting a single monitor cluster up I would like to switch gears and
> focus on that.
>
> Looking at the dev docs:
> http://docs.ceph.com/docs/hammer/architecture/#cluster-map
> it has 5 main parts::
>
> ```
> The Monitor Map: Contains the cluster fsid, the position, name address and
> port of each monitor. It also indicates the current epoch, when the map was
> created, and the last time

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-12 Thread Sean Sullivan

ceph-monstore-tool? Is that the same as monmaptool? oops! NM found it in
ceph-test package::

I can't seem to get it working :-( dump monmap or any of the commands. They
all bomb out with the same message:

root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
/var/lib/ceph/mon/ceph-kh10-8 dump-trace -- /tmp/test.trace
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
store.db/10882319.ldb
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# ceph-monstore-tool
/var/lib/ceph/mon/ceph-kh10-8 dump-keys
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
store.db/10882319.ldb


I need to clarify as I originally had 2 clusters with this issue and now I
have 1 with all 3 monitors dead and 1 that I was successfully able to
repair. I am about to recap everything I know about the issue and the issue
at hand. Should I start a new email thread about this instead?

The cluster that is currently having issues is on hammer (94.7), and the
monitor stats are the same::
root@kh08-8:~# cat /proc/cpuinfo | grep -iE "model name" | uniq -c
 24 model name : Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
 ext4 volume comprised of 4x300GB 10k drives in raid 10.
 ubuntu 14.04

root@kh08-8:~# uname -a
Linux kh08-8 3.13.0-76-generic #120-Ubuntu SMP Mon Jan 18 15:59:10 UTC 2016
x86_64 x86_64 x86_64 GNU/Linux
root@kh08-8:~# ceph --version
ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)


>From here: Here are the errors I am getting when starting each of the
monitors::


---
root@kh08-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh08-8 -d
2016-08-11 22:15:23.731550 7fe5ad3e98c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 317309
Corruption: error in middle of record
2016-08-11 22:15:28.274340 7fe5ad3e98c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-kh08-8': (22) Invalid argument
--
root@kh09-8:~# /usr/bin/ceph-mon --cluster=ceph -i kh09-8 -d
2016-08-11 22:14:28.252370 7f7eaab908c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 30
Corruption: 14 missing files; e.g.: /var/lib/ceph/mon/ceph-kh09-8/
store.db/10845998.ldb
2016-08-11 22:14:35.094237 7f7eaab908c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-kh09-8': (22) Invalid argument
--
root@kh10-8:/var/lib/ceph/mon/ceph-kh10-8/store.db# /usr/bin/ceph-mon
--cluster=ceph -i kh10-8 -d
2016-08-11 22:17:54.632762 7f80bf34d8c0  0 ceph version 0.94.7
(d56bdf93ced6b80b07397d57e3fa68fe68304432), process ceph-mon, pid 292620
Corruption: 1 missing files; e.g.: /var/lib/ceph/mon/ceph-kh10-8/
store.db/10882319.ldb
2016-08-11 22:18:01.207749 7f80bf34d8c0 -1 error opening mon data directory
at '/var/lib/ceph/mon/ceph-kh10-8': (22) Invalid argument
---


for kh08, a coworker patched leveldb to print and skip on the first error
and that one is also missing a bunch of files. As such I think kh10-8 is my
most likely candidate to recover but either way recovery is probably not an
option. I see leveldb has a repair.cc (https://github.com/google/lev
eldb/blob/master/db/repair.cc)) but I do not see repair mentioned in
monitor in respect to the dbstore. I tried using the leveldb python module
(plyvel) to attempt a repair but my repl just ends up dying.

I understand two things:: 1.) Without rebuilding the monitor backend
leveldb (the cluster map as I understand it) store all of the data in the
cluster is essentialy lost (right?)
 2.) it is possible to rebuild this
database via some form of magic or (source)ry as all of this data is
essential held throughout the cluster as well.

We only use radosgw / S3 for this cluster. If there is a way to recover my
data that is easier//more likely than rebuilding the leveldb of a monitor
and starting a single monitor cluster up I would like to switch gears and
focus on that.

Looking at the dev docs:
http://docs.ceph.com/docs/hammer/architecture/#cluster-map
it has 5 main parts::

```
The Monitor Map: Contains the cluster fsid, the position, name address and
port of each monitor. It also indicates the current epoch, when the map was
created, and the last time it changed. To view a monitor map, execute ceph
mon dump.
The OSD Map: Contains the cluster fsid, when the map was created and last
modified, a list of pools, replica sizes, PG numbers, a list of OSDs and
their status (e.g., up, in). To view an OSD map, execute ceph osd dump.
The PG Map: Contains the PG version, its time stamp, the last OSD map
epoch, the full ratios, and details on each placement group such as the PG
ID, the Up Set, the Acting Set, the state of the PG (e.g., active + clean),
and data usage statistics for each pool.
The CRUSH Map: Contains a list of storage devices, the failure domain
hierarchy (e.g., device, host, rack, row, room, etc.), and rules for
traversing the hierarchy when storing data. To view a CRUSH map, execute
ceph osd getcrushmap -o

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Wido den Hollander


> Op 11 augustus 2016 om 15:17 schreef Sean Sullivan :
> 
> 
> Hello Wido,
> 
> Thanks for the advice.  While the data center has a/b circuits and
> redundant power, etc if a ground fault happens it  travels outside and
> fails causing the whole building to fail (apparently).
> 
> The monitors are each the same with
> 2x e5 cpus
> 64gb of ram
> 4x 300gb 10k SAS drives in raid 10 (write through mode).
> Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 -
> 3am CST)
> Ceph hammer LTS 0.94.7
> 
> (we are still working on our jewel test cluster so it is planned but not in
> place yet)
> 
> The only thing that seems to be corrupt is the monitors leveldb store.  I
> see multiple issues on Google leveldb github from March 2016 about fsync
> and power failure so I assume this is an issue with leveldb.
> 
> I have backed up /var/lib/ceph/Mon on all of my monitors before trying to
> proceed with any form of recovery.
> 
> Is there any way to reconstruct the leveldb or replace the monitors and
> recover the data?
> 
I don't know. I have never done it. Other people might know this better than me.

Maybe 'ceph-monstore-tool' can help you?

Wido

> I found the following post in which sage says it is tedious but possible. (
> http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if
> I have any chance of doing it.  I have the fsid, the Mon key map and all of
> the osds look to be fine so all of the previous osd maps  are there.
> 
> I just don't understand what key/values I need inside.
> 
> On Aug 11, 2016 1:33 AM, "Wido den Hollander"  wrote:
> 
> >
> > > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <
> > seapasu...@uchicago.edu>:
> > >
> > >
> > > I think it just got worse::
> > >
> > > all three monitors on my other cluster say that ceph-mon can't open
> > > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose
> > all
> > > 3 monitors? I saw a post by Sage saying that the data can be recovered as
> > > all of the data is held on other servers. Is this possible? If so has
> > > anyone had any experience doing so?
> >
> > I have never done so, so I couldn't tell you.
> >
> > However, it is weird that on all three it got corrupted. What hardware are
> > you using? Was it properly protected against power failure?
> >
> > If you mon store is corrupted I'm not sure what might happen.
> >
> > However, make a backup of ALL monitors right now before doing anything.
> >
> > Wido
> >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Sean Sullivan

Hello Wido,

Thanks for the advice.  While the data center has a/b circuits and
redundant power, etc if a ground fault happens it  travels outside and
fails causing the whole building to fail (apparently).

The monitors are each the same with
2x e5 cpus
64gb of ram
4x 300gb 10k SAS drives in raid 10 (write through mode).
Ubuntu 14.04 with the latest updates prior to power failure (2016/Aug/10 -
3am CST)
Ceph hammer LTS 0.94.7

(we are still working on our jewel test cluster so it is planned but not in
place yet)

The only thing that seems to be corrupt is the monitors leveldb store.  I
see multiple issues on Google leveldb github from March 2016 about fsync
and power failure so I assume this is an issue with leveldb.

I have backed up /var/lib/ceph/Mon on all of my monitors before trying to
proceed with any form of recovery.

Is there any way to reconstruct the leveldb or replace the monitors and
recover the data?

I found the following post in which sage says it is tedious but possible. (
http://www.spinics.net/lists/ceph-devel/msg06662.html). Tedious is fine if
I have any chance of doing it.  I have the fsid, the Mon key map and all of
the osds look to be fine so all of the previous osd maps  are there.

I just don't understand what key/values I need inside.

On Aug 11, 2016 1:33 AM, "Wido den Hollander"  wrote:

>
> > Op 11 augustus 2016 om 0:10 schreef Sean Sullivan <
> seapasu...@uchicago.edu>:
> >
> >
> > I think it just got worse::
> >
> > all three monitors on my other cluster say that ceph-mon can't open
> > /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose
> all
> > 3 monitors? I saw a post by Sage saying that the data can be recovered as
> > all of the data is held on other servers. Is this possible? If so has
> > anyone had any experience doing so?
>
> I have never done so, so I couldn't tell you.
>
> However, it is weird that on all three it got corrupted. What hardware are
> you using? Was it properly protected against power failure?
>
> If you mon store is corrupted I'm not sure what might happen.
>
> However, make a backup of ALL monitors right now before doing anything.
>
> Wido
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Tomasz Kuzemko

I'm guessing you had writeback cache enabled on ceph-mon disk (smartctl
-g wcache /dev/sdX) and disk firmware did not care about respecting
flush semantics.

On 11.08.2016 08:33, Wido den Hollander wrote:
> 
>> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan :
>>
>>
>> I think it just got worse::
>>
>> all three monitors on my other cluster say that ceph-mon can't open
>> /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all
>> 3 monitors? I saw a post by Sage saying that the data can be recovered as
>> all of the data is held on other servers. Is this possible? If so has
>> anyone had any experience doing so?
> 
> I have never done so, so I couldn't tell you.
> 
> However, it is weird that on all three it got corrupted. What hardware are 
> you using? Was it properly protected against power failure?
> 
> If you mon store is corrupted I'm not sure what might happen.
> 
> However, make a backup of ALL monitors right now before doing anything.
> 
> Wido
> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Tomasz Kuzemko
tomasz.kuze...@corp.ovh.com



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

2016-08-11 Thread Wido den Hollander


> Op 11 augustus 2016 om 0:10 schreef Sean Sullivan :
> 
> 
> I think it just got worse::
> 
> all three monitors on my other cluster say that ceph-mon can't open
> /var/lib/ceph/mon/$(hostname). Is there any way to recover if you lose all
> 3 monitors? I saw a post by Sage saying that the data can be recovered as
> all of the data is held on other servers. Is this possible? If so has
> anyone had any experience doing so?

I have never done so, so I couldn't tell you.

However, it is weird that on all three it got corrupted. What hardware are you 
using? Was it properly protected against power failure?

If you mon store is corrupted I'm not sure what might happen.

However, make a backup of ALL monitors right now before doing anything.

Wido

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

Re: [ceph-users] Fwd: lost power. monitors died. Cephx errors now

8 matches

Site Navigation

Mail list logo

Footer information