[ceph-users] Re: Power outage recovery

2022-09-15 Thread Gregory Farnum
Recovery from OSDs loses the mds and rgw keys they use to authenticate with
cephx. You need to get those set up again by using the auth commands. I
don’t have them handy but it is discussed in the mailing list archives.
-Greg

On Thu, Sep 15, 2022 at 3:28 PM Jorge Garcia  wrote:

> Yes, I tried restarting them and even rebooting the mds machine. No joy.
> If I try to start ceph-mds by hand, it returns:
>
> 2022-09-15 15:21:39.848 7fc43dbd2700 -1 monclient(hunting):
> handle_auth_bad_method server allowed_methods [2] but i only support [2]
> failed to fetch mon config (--no-mon-config to skip)
>
> I found this information online, maybe something to try next:
>
> https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/
>
> But I think maybe the mds needs to be running before that?
>
> On 9/15/22 15:19, Wesley Dillingham wrote:
> > Having the quorum / monitors back up may change the MDS and RGW's
> > ability to start and stay running. Have you tried just restarting the
> > MDS / RGW daemons again?
> >
> > Respectfully,
> >
> > *Wes Dillingham*
> > w...@wesdillingham.com
> > LinkedIn 
> >
> >
> > On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia 
> wrote:
> >
> > OK, I'll try to give more details as I remember them.
> >
> > 1. There was a power outage and then power came back up.
> >
> > 2. When the systems came back up, I did a "ceph -s" and it never
> > returned. Further investigation revealed that the ceph-mon
> > processes had
> > not started in any of the 3 monitors. I looked at the log files
> > and it
> > said something about:
> >
> > ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
> > found 30790637387776 in
> > /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")
> >
> > Looking at the internet, I found some suggestions about
> > troubleshooting
> > monitors in:
> >
> >
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
> >
> > I quickly determined that the monitors weren't running, so I found
> > the
> > section where it said "RECOVERY USING OSDS". The description made
> > sense:
> >
> > "But what if all monitors fail at the same time? Since users are
> > encouraged to deploy at least three (and preferably five) monitors
> > in a
> > Ceph cluster, the chance of simultaneous failure is rare. But
> > unplanned
> > power-downs in a data center with improperly configured disk/fs
> > settings
> > could fail the underlying file system, and hence kill all the
> > monitors.
> > In this case, we can recover the monitor store with the information
> > stored in OSDs."
> >
> > So, I did the procedure described in that section, and then made sure
> > the correct keys were in the keyring and restarted the processes.
> >
> > WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL
> > MESSAGE, AND
> > NOW THE MONITORS ARE BACK! I must have missed some step in the
> > middle of
> > my panic.
> >
> > # ceph -s
> >
> >cluster:
> >  id: ----
> >  health: HEALTH_WARN
> >  mons are allowing insecure global_id reclaim
> >
> >services:
> >  mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
> >  mgr: host-b(active, since 19m), standbys: host-a, host-c
> >  osd: 164 osds: 164 up (since 16m), 164 in (since 8h)
> >
> >data:
> >  pools:   14 pools, 2992 pgs
> >  objects: 91.58M objects, 290 TiB
> >  usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
> >  pgs: 2985 active+clean
> >   7active+clean+scrubbing+deep
> >
> > Couple of missing or strange things:
> >
> > 1. Missing mds
> > 2. Missing rgw
> > 3. New warning showing up
> >
> > But overall, better than a couple hours ago. If anybody is still
> > reading
> > and has any suggestions about how to solve the 3 items above, that
> > would
> > be great! Otherwise, back to scanning the internet for ideas...
> >
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
> >
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Power outage recovery

2022-09-15 Thread Jorge Garcia
Yes, I tried restarting them and even rebooting the mds machine. No joy. 
If I try to start ceph-mds by hand, it returns:


2022-09-15 15:21:39.848 7fc43dbd2700 -1 monclient(hunting): 
handle_auth_bad_method server allowed_methods [2] but i only support [2]

failed to fetch mon config (--no-mon-config to skip)

I found this information online, maybe something to try next:

https://docs.ceph.com/en/quincy/cephfs/recover-fs-after-mon-store-loss/

But I think maybe the mds needs to be running before that?

On 9/15/22 15:19, Wesley Dillingham wrote:
Having the quorum / monitors back up may change the MDS and RGW's 
ability to start and stay running. Have you tried just restarting the 
MDS / RGW daemons again?


Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia  wrote:

OK, I'll try to give more details as I remember them.

1. There was a power outage and then power came back up.

2. When the systems came back up, I did a "ceph -s" and it never
returned. Further investigation revealed that the ceph-mon
processes had
not started in any of the 3 monitors. I looked at the log files
and it
said something about:

ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
found 30790637387776 in
/var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")

Looking at the internet, I found some suggestions about
troubleshooting
monitors in:

https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

I quickly determined that the monitors weren't running, so I found
the
section where it said "RECOVERY USING OSDS". The description made
sense:

"But what if all monitors fail at the same time? Since users are
encouraged to deploy at least three (and preferably five) monitors
in a
Ceph cluster, the chance of simultaneous failure is rare. But
unplanned
power-downs in a data center with improperly configured disk/fs
settings
could fail the underlying file system, and hence kill all the
monitors.
In this case, we can recover the monitor store with the information
stored in OSDs."

So, I did the procedure described in that section, and then made sure
the correct keys were in the keyring and restarted the processes.

WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL
MESSAGE, AND
NOW THE MONITORS ARE BACK! I must have missed some step in the
middle of
my panic.

# ceph -s

   cluster:
 id: ----
 health: HEALTH_WARN
 mons are allowing insecure global_id reclaim

   services:
 mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
 mgr: host-b(active, since 19m), standbys: host-a, host-c
 osd: 164 osds: 164 up (since 16m), 164 in (since 8h)

   data:
 pools:   14 pools, 2992 pgs
 objects: 91.58M objects, 290 TiB
 usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
 pgs: 2985 active+clean
  7    active+clean+scrubbing+deep

Couple of missing or strange things:

1. Missing mds
2. Missing rgw
3. New warning showing up

But overall, better than a couple hours ago. If anybody is still
reading
and has any suggestions about how to solve the 3 items above, that
would
be great! Otherwise, back to scanning the internet for ideas...

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Power outage recovery

2022-09-15 Thread Wesley Dillingham
Having the quorum / monitors back up may change the MDS and RGW's ability
to start and stay running. Have you tried just restarting the MDS / RGW
daemons again?

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, Sep 15, 2022 at 5:54 PM Jorge Garcia  wrote:

> OK, I'll try to give more details as I remember them.
>
> 1. There was a power outage and then power came back up.
>
> 2. When the systems came back up, I did a "ceph -s" and it never
> returned. Further investigation revealed that the ceph-mon processes had
> not started in any of the 3 monitors. I looked at the log files and it
> said something about:
>
> ceph_abort_msg("Bad table magic number: expected 9863518390377041911,
> found 30790637387776 in
> /var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")
>
> Looking at the internet, I found some suggestions about troubleshooting
> monitors in:
>
> https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
>
> I quickly determined that the monitors weren't running, so I found the
> section where it said "RECOVERY USING OSDS". The description made sense:
>
> "But what if all monitors fail at the same time? Since users are
> encouraged to deploy at least three (and preferably five) monitors in a
> Ceph cluster, the chance of simultaneous failure is rare. But unplanned
> power-downs in a data center with improperly configured disk/fs settings
> could fail the underlying file system, and hence kill all the monitors.
> In this case, we can recover the monitor store with the information
> stored in OSDs."
>
> So, I did the procedure described in that section, and then made sure
> the correct keys were in the keyring and restarted the processes.
>
> WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL MESSAGE, AND
> NOW THE MONITORS ARE BACK! I must have missed some step in the middle of
> my panic.
>
> # ceph -s
>
>cluster:
>  id: ----
>  health: HEALTH_WARN
>  mons are allowing insecure global_id reclaim
>
>services:
>  mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
>  mgr: host-b(active, since 19m), standbys: host-a, host-c
>  osd: 164 osds: 164 up (since 16m), 164 in (since 8h)
>
>data:
>  pools:   14 pools, 2992 pgs
>  objects: 91.58M objects, 290 TiB
>  usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
>  pgs: 2985 active+clean
>   7active+clean+scrubbing+deep
>
> Couple of missing or strange things:
>
> 1. Missing mds
> 2. Missing rgw
> 3. New warning showing up
>
> But overall, better than a couple hours ago. If anybody is still reading
> and has any suggestions about how to solve the 3 items above, that would
> be great! Otherwise, back to scanning the internet for ideas...
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Power outage recovery

2022-09-15 Thread Jorge Garcia

OK, I'll try to give more details as I remember them.

1. There was a power outage and then power came back up.

2. When the systems came back up, I did a "ceph -s" and it never 
returned. Further investigation revealed that the ceph-mon processes had 
not started in any of the 3 monitors. I looked at the log files and it 
said something about:


ceph_abort_msg("Bad table magic number: expected 9863518390377041911, 
found 30790637387776 in 
/var/lib/ceph/mon/ceph-gi-cprv-adm-01/store.db/2886524.sst")


Looking at the internet, I found some suggestions about troubleshooting 
monitors in:


https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/

I quickly determined that the monitors weren't running, so I found the 
section where it said "RECOVERY USING OSDS". The description made sense:


"But what if all monitors fail at the same time? Since users are 
encouraged to deploy at least three (and preferably five) monitors in a 
Ceph cluster, the chance of simultaneous failure is rare. But unplanned 
power-downs in a data center with improperly configured disk/fs settings 
could fail the underlying file system, and hence kill all the monitors. 
In this case, we can recover the monitor store with the information 
stored in OSDs."


So, I did the procedure described in that section, and then made sure 
the correct keys were in the keyring and restarted the processes.


WELL, I WAS REDOING ALL THESE STEPS WHILE WRITING THIS MAIL MESSAGE, AND 
NOW THE MONITORS ARE BACK! I must have missed some step in the middle of 
my panic.


# ceph -s

  cluster:
    id: ----
    health: HEALTH_WARN
    mons are allowing insecure global_id reclaim

  services:
    mon: 3 daemons, quorum host-a, host-b, host-c (age 19m)
    mgr: host-b(active, since 19m), standbys: host-a, host-c
    osd: 164 osds: 164 up (since 16m), 164 in (since 8h)

  data:
    pools:   14 pools, 2992 pgs
    objects: 91.58M objects, 290 TiB
    usage:   437 TiB used, 1.2 PiB / 1.7 PiB avail
    pgs: 2985 active+clean
 7    active+clean+scrubbing+deep

Couple of missing or strange things:

1. Missing mds
2. Missing rgw
3. New warning showing up

But overall, better than a couple hours ago. If anybody is still reading 
and has any suggestions about how to solve the 3 items above, that would 
be great! Otherwise, back to scanning the internet for ideas...


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Power outage recovery

2022-09-15 Thread Wesley Dillingham
What does "ceph status" "ceph health detail" etc show, currently?

Based on what you have said here my thought is you have created a new
monitor quorum and as such all auth details from the old cluster are lost
including any and all mgr cephx auth keys, so what does the log for the mgr
say? How many monitors did you have before? Do you have a backup the old
monitor store?

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, Sep 15, 2022 at 2:18 PM Marc  wrote:

> > (particularly the "Recovery using OSDs" section). I got it so the mon
> > processes would start, but then the ceph-mgr process died, and would not
> > restart. Not sure how to recover so both ceph-mgr and ceph-mon processes
> > run. In the meantime, all the data is gone. Any suggestions?
>
> All the data is gone? the osd's are running all. Your networking is fine?
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Power outage recovery

2022-09-15 Thread Eugen Block
The data only seems to be gone (if you mean what I think you mean)  
because the MGRs are not running and the OSDs can’t report their  
status. But are all MONs and OSDs up? What is the ceph status? What do  
the MGRs log when trying to start them?


Zitat von Jorge Garcia :


We have a Nautilus cluster that just got hit by a bad power outage. When
the admin systems came back up, only the ceph-mgr process was running (all
the ceph-mon processes would not start). I tried following the instructions
in
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-mon/
(particularly the "Recovery using OSDs" section). I got it so the mon
processes would start, but then the ceph-mgr process died, and would not
restart. Not sure how to recover so both ceph-mgr and ceph-mon processes
run. In the meantime, all the data is gone. Any suggestions?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io




___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Power outage recovery

2022-09-15 Thread Marc
> (particularly the "Recovery using OSDs" section). I got it so the mon
> processes would start, but then the ceph-mgr process died, and would not
> restart. Not sure how to recover so both ceph-mgr and ceph-mon processes
> run. In the meantime, all the data is gone. Any suggestions?

All the data is gone? the osd's are running all. Your networking is fine?
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io