[ceph-users] Re: Help: corrupt pg

2020-03-27 Thread Jake Grimmett

Hi Greg,

Yes, this was caused by a chain of event. As a cautionary tale, the main 
ones were:



1) minor nautilus release upgrade, followed by a rolling node restart 
script that mistakenly relied on "ceph -s" for cluster health info,


i.e. it didn't wait for the cluster to return to health before moving 
onto restarting the next node, which caused...


2) several pgs to have "unfound object", attempted fix was to " 
mark_unfound_lost delete", which caused


3) primary osd (of the pg) to crash with: PrimaryLogPG.cc: 11550: FAILED 
ceph_assert(head_obc)


4) only fix we found for the primary pg crashing, was to destroy the 
primary osd, and wait for the pg to recover. This worked, or so we thought,


5) however looking at logs, one pg was in 
"active+recovery_unfound+undersized+degraded+remapped", I think I may 
have "mark_unfound_lost delete" this pg


6) after destroying the primary osd for this pg (oops!) the pg then went 
"inactive"


7) to restore access we set "ceph osd pool set ec82pool min_size 8" (on 
an EC 8+2 pool), at which point


8) the new primary OSD crashed with  FAILED 
ceph_assert(clone_size.count(clone)) leaving us with a pg in a very bad 
state...



I will see if we can buy some consulting time, the alternative is 
several weeks of rsync.


Many thanks again for your advice, it's very much appreciated,

Jake

On 26/03/2020 17:21, Gregory Farnum wrote:

On Wed, Mar 25, 2020 at 5:19 AM Jake Grimmett  wrote:

Dear All,

We are "in a bit of a pickle"...

No reply to my message (23/03/2020),  subject  "OSD: FAILED
ceph_assert(clone_size.count(clone))"

So I'm presuming it's not possible to recover the crashed OSD

>From your later email it sounds like this is part of a chain of events
that included telling the OSDs to deliberately work around some
known-old-or-wrong state, so while I wouldn't say it's impossible to
fix it's probably not simple and may require buying some consulting
from one of the teams that do that. I don't think I've seen these
errors before anywhere, certainly.


This is bad news, as one pg may be lost, (we are using EC 8+2, pg dump
shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )

Without this pg we have 1.8PB of broken cephfs.

I could rebuild the cluster from scratch, but this means no user backups
for a couple of weeks.

The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data
(replicated NVMe metdata pool) and is running Nautilus 14.2.8

Clearly, it would be nicer if we could fix the OSD, but if this isn't
possible, can someone confirm that the right procedure to recover from a
corrupt pg is:

1) Stop all client access
2) find all files that store data on the bad pg, with:
# cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
3) delete all of these bad files - presumably using truncate? or is "rm"
fine?
4) destroy the bad pg
# ceph osd  force-create-pg 5.750
5) Copy the missing files back with rsync or similar...

This sounds about right.
Keep in mind that the PG will just be a random fraction of the
default-4MB objects in CephFS, so you will hit a huge proportion of
large files. As I assume this is a data pool, keep in mind that any
large files will just show up as a 4MB hole where the missing object
is — this may be preferable to reverting the whole file, or let you
only copy in the missing data if they are logs, or whatever. And you
can use the CephFS metadata to determine if the file in backup is
up-to-date or not.
-Greg


a better "recipe" or other advice gratefully received,

best regards,
Jake




Note: I am working from home until further notice.

For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


Note: I am working from home until further notice.
For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help: corrupt pg

2020-03-26 Thread Gregory Farnum
On Wed, Mar 25, 2020 at 5:19 AM Jake Grimmett  wrote:
>
> Dear All,
>
> We are "in a bit of a pickle"...
>
> No reply to my message (23/03/2020),  subject  "OSD: FAILED
> ceph_assert(clone_size.count(clone))"
>
> So I'm presuming it's not possible to recover the crashed OSD

From your later email it sounds like this is part of a chain of events
that included telling the OSDs to deliberately work around some
known-old-or-wrong state, so while I wouldn't say it's impossible to
fix it's probably not simple and may require buying some consulting
from one of the teams that do that. I don't think I've seen these
errors before anywhere, certainly.

>
> This is bad news, as one pg may be lost, (we are using EC 8+2, pg dump
> shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )
>
> Without this pg we have 1.8PB of broken cephfs.
>
> I could rebuild the cluster from scratch, but this means no user backups
> for a couple of weeks.
>
> The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data
> (replicated NVMe metdata pool) and is running Nautilus 14.2.8
>
> Clearly, it would be nicer if we could fix the OSD, but if this isn't
> possible, can someone confirm that the right procedure to recover from a
> corrupt pg is:
>
> 1) Stop all client access
> 2) find all files that store data on the bad pg, with:
> # cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
> 3) delete all of these bad files - presumably using truncate? or is "rm"
> fine?
> 4) destroy the bad pg
> # ceph osd  force-create-pg 5.750
> 5) Copy the missing files back with rsync or similar...

This sounds about right.
Keep in mind that the PG will just be a random fraction of the
default-4MB objects in CephFS, so you will hit a huge proportion of
large files. As I assume this is a data pool, keep in mind that any
large files will just show up as a 4MB hole where the missing object
is — this may be preferable to reverting the whole file, or let you
only copy in the missing data if they are logs, or whatever. And you
can use the CephFS metadata to determine if the file in backup is
up-to-date or not.
-Greg

>
> a better "recipe" or other advice gratefully received,
>
> best regards,
> Jake
>
>
> 
>
> Note: I am working from home until further notice.
>
> For help, contact unixad...@mrc-lmb.cam.ac.uk
> --
> Dr Jake Grimmett
> Head Of Scientific Computing
> MRC Laboratory of Molecular Biology
> Francis Crick Avenue,
> Cambridge CB2 0QH, UK.
> Phone 01223 267019
> Mobile 0776 9886539
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help: corrupt pg

2020-03-25 Thread Jake Grimmett

Hi Eugen,

Many thanks for your reply.

The other two OSD's are up and running, and being used by other pgs with 
no problem, for some reason this pg refuses to use these OSD's.


The other two OSDs that are missing from this pg crashed at different 
times last month, each OSD crashed when we tried to fix a pg with 
recovery_unfound by running a command like:


# ceph pg  5.3fa mark_unfound_lost delete

the osd crash is shown in the osd log file here 



"mark_unfound_lost delete" occurs at line 3708

This caused the primary osd to crash with: PrimaryLogPG.cc: 11550: 
FAILED ceph_assert(head_obc)


when the osd tries to restart, we see lots of log entries similar to:

    -3> 2020-02-10 12:25:58.795 7f5935dfe700  1 get compressor lz4 = 
0x55cd193d34a0


and...

-1274> 2020-02-10 12:23:24.661 7f5936e00700  5 
bluestore(/var/lib/ceph/osd/ceph-443) _do_alloc_write  0x2 bytes 
compressed using lz4 failed with errcode = -1, leaving uncompressed


the osd then repeatedly crashes with "PrimaryLogPG.cc: 11550: FAILED 
ceph_assert(head_obc)" but with no more "compressor lz4" entries


the only fix we found was to destroy & recreate the osd, and then allow 
ceph to recover.


We thought that we could fix the small number of recovery unfound pgs by 
allowing their primary OSD to crash, and then recreating it.


Unfortunately, while I was waiting for the pg to heal, we seem to have 
got caught by another bug, as another osd in this pg got hit with


"OSD: FAILED ceph_assert(clone_size.count(clone))". this log is here:



the full "ceph pg dump" for this failed pg is

[root@ceph1 ~]# ceph pg dump | grep ^5.750
dumped all
5.750    190408  0    0 0   0 
569643615603   0  0 3090 
3090 down+remapped 2020-03-25 
11:17:47.228805  35398'3381328  35968:3266057 
[234,354,304,388,125,25,427,226,77,154]    234 
[NONE,NONE,NONE,388,125,25,427,226,77,154]    388 24471'3200829 
2020-01-28 15:48:35.574934   24471'3200829 2020-01-28 15:48:35.574934


I did notice this other LZ4 corruption bug: 
https://tracker.ceph.com/issues/39525 not sure if there is any relation..


best regards,

Jake


On 25/03/2020 14:22, Eugen Block wrote:

Hi,

is there any chance to recover the other failing OSDs that seem to 
have one chunk of this PG? Do the other OSDs fail with the same error?



Zitat von Jake Grimmett :


Dear All,

We are "in a bit of a pickle"...

No reply to my message (23/03/2020),  subject  "OSD: FAILED 
ceph_assert(clone_size.count(clone))"


So I'm presuming it's not possible to recover the crashed OSD

This is bad news, as one pg may be lost, (we are using EC 8+2, pg 
dump shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )


Without this pg we have 1.8PB of broken cephfs.

I could rebuild the cluster from scratch, but this means no user 
backups for a couple of weeks.


The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data 
(replicated NVMe metdata pool) and is running Nautilus 14.2.8


Clearly, it would be nicer if we could fix the OSD, but if this isn't 
possible, can someone confirm that the right procedure to recover 
from a corrupt pg is:


1) Stop all client access
2) find all files that store data on the bad pg, with:
# cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
3) delete all of these bad files - presumably using truncate? or is 
"rm" fine?

4) destroy the bad pg
# ceph osd  force-create-pg 5.750
5) Copy the missing files back with rsync or similar...

a better "recipe" or other advice gratefully received,

best regards,
Jake




Note: I am working from home until further notice.

For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


Note: I am working from home until further notice.
For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: Help: corrupt pg

2020-03-25 Thread Eugen Block

Hi,

is there any chance to recover the other failing OSDs that seem to  
have one chunk of this PG? Do the other OSDs fail with the same error?



Zitat von Jake Grimmett :


Dear All,

We are "in a bit of a pickle"...

No reply to my message (23/03/2020),  subject  "OSD: FAILED  
ceph_assert(clone_size.count(clone))"


So I'm presuming it's not possible to recover the crashed OSD

This is bad news, as one pg may be lost, (we are using EC 8+2, pg  
dump shows [NONE,NONE,NONE,388,125,25,427,226,77,154] )


Without this pg we have 1.8PB of broken cephfs.

I could rebuild the cluster from scratch, but this means no user  
backups for a couple of weeks.


The cluster has 10 nodes, uses an EC 8:2 pool for cephfs data  
(replicated NVMe metdata pool) and is running Nautilus 14.2.8


Clearly, it would be nicer if we could fix the OSD, but if this  
isn't possible, can someone confirm that the right procedure to  
recover from a corrupt pg is:


1) Stop all client access
2) find all files that store data on the bad pg, with:
# cephfs-data-scan pg_files /backup 5.750 2> /dev/null > /root/bad_files
3) delete all of these bad files - presumably using truncate? or is  
"rm" fine?

4) destroy the bad pg
# ceph osd  force-create-pg 5.750
5) Copy the missing files back with rsync or similar...

a better "recipe" or other advice gratefully received,

best regards,
Jake




Note: I am working from home until further notice.

For help, contact unixad...@mrc-lmb.cam.ac.uk
--
Dr Jake Grimmett
Head Of Scientific Computing
MRC Laboratory of Molecular Biology
Francis Crick Avenue,
Cambridge CB2 0QH, UK.
Phone 01223 267019
Mobile 0776 9886539

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io