Re: [ceph-users] pgs inconsistent

2019-08-16 Thread Ronny Aasen

On 15.08.2019 16:38, huxia...@horebdata.cn wrote:

Dear folks,

I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, 
on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still 
fine) due to some errors on rocksdb crash. I tried to restart that OSD 
but failed. So I tried to rebalance but encountered PGs inconsistent.


what can i do to make the cluster working again?

thanks a lot for helping me out

Samuel

**
# ceph -s
   cluster:
     id:     289e3afa-f188-49b0-9bea-1ab57cc2beb8
     health: HEALTH_ERR
             pauserd,pausewr,noout flag(s) set
             191444 scrub errors
             Possible data damage: 376 pgs inconsistent
   services:
     mon: 3 daemons, quorum horeb71,horeb72,horeb73
     mgr: horeb73(active), standbys: horeb71, horeb72
     osd: 9 osds: 8 up, 8 in
          flags pauserd,pausewr,noout
   data:
     pools:   1 pools, 1024 pgs
     objects: 524.29k objects, 1.99TiB
     usage:   3.67TiB used, 2.58TiB / 6.25TiB avail
     pgs:     645 active+clean
              376 active+clean+inconsistent
              3   active+clean+scrubbing+deep



that was a lot of inconsistent pg's. When you say replication = 2 do you 
mean you have 2 copies as in size=3 min-size=2 , or that you have size=2 
min-size=1 ?


the reason i ask is that min-size=1 is a well known way to get into lots 
of problems. (one disk can accept a write alone, and before it is 
recoverd/backfilled the drive can die)


if you have min-size=1 i would recommend you set min-size=2 as the first 
step, to avoid creating more inconsistency while troubleshooting. if you 
have the space for it in the cluster you should also set size=3


if you run "#ceph health detail" you will get a list of the pg's that 
are inconsistent. check if there is a repeat offender osd in that list 
of pg's, and check that disk for issues. check dmesg and logs of the 
osd, and if there are smart errors.


You can try to repair the inconsistent pg's automagically by running the 
command  "#ceph pg repair [pg id]" but make sure the hardware is good 
first.



good luck
Ronny


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pgs inconsistent

2019-08-15 Thread huxia...@horebdata.cn
Dear folks,

I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, on 
Luminous 12.2.12. Some days ago i had one OSD down (the disk is still fine) due 
to some errors on rocksdb crash. I tried to restart that OSD but failed. So I 
tried to rebalance but encountered PGs inconsistent.

what can i do to make the cluster working again?

thanks a lot for helping me out

Samuel 

**
# ceph -s
  cluster:
id: 289e3afa-f188-49b0-9bea-1ab57cc2beb8
health: HEALTH_ERR
pauserd,pausewr,noout flag(s) set
191444 scrub errors
Possible data damage: 376 pgs inconsistent
 
  services:
mon: 3 daemons, quorum horeb71,horeb72,horeb73
mgr: horeb73(active), standbys: horeb71, horeb72
osd: 9 osds: 8 up, 8 in
 flags pauserd,pausewr,noout
 
  data:
pools:   1 pools, 1024 pgs
objects: 524.29k objects, 1.99TiB
usage:   3.67TiB used, 2.58TiB / 6.25TiB avail
pgs: 645 active+clean
 376 active+clean+inconsistent
 3   active+clean+scrubbing+deep
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread Christian Wuerdig
I'm not a big expert but the OP said he's suspecting bitrot is at
least part of issue in which case you can have the situation where the
drive has ACK'ed the write but a later scrub discovered checksum
errors
Plus you don't need to actually loose a drive to get inconsistent pgs
with size=2 min_size=1 > flapping OSDs (even just temporary) while the
cluster is receiving writes can generate this.

On Fri, Nov 3, 2017 at 12:05 PM, Denes Dolhay  wrote:
> Hi Greg,
>
> Accepting the fact, that an osd with outdated data can never accept write,
> or io of any kind, how is it possible, that the system goes into this state?
>
> -All osds are Bluestore, checksum, mtime etc.
>
> -All osds are up and in
>
> -No hw failures, lost disks, damaged journals or databases etc.
>
> -The data became inconsistent
>
>
> Thanks,
>
> Denke.
>
>
> On 11/02/2017 11:51 PM, Gregory Farnum wrote:
>
>
> On Thu, Nov 2, 2017 at 1:21 AM koukou73gr  wrote:
>>
>> The scenario is actually a bit different, see:
>>
>> Let's assume size=2, min_size=1
>> -We are looking at pg "A" acting [1, 2]
>> -osd 1 goes down
>> -osd 2 accepts a write for pg "A"
>> -osd 2 goes down
>> -osd 1 comes back up, while osd 2 still down
>> -osd 1 has no way to know osd 2 accepted a write in pg "A"
>> -osd 1 accepts a new write to pg "A"
>> -osd 2 comes back up.
>>
>> bang! osd 1 and 2 now have different views of pg "A" but both claim to
>> have current data.
>
>
> In this case, OSD 1 will not accept IO precisely because it can not prove it
> has the current data. That is the basic purpose of OSD peering and holds in
> all cases.
> -Greg
>
>>
>>
>> -K.
>>
>> On 2017-11-01 20:27, Denes Dolhay wrote:
>> > Hello,
>> >
>> > I have a trick question for Mr. Turner's scenario:
>> > Let's assume size=2, min_size=1
>> > -We are looking at pg "A" acting [1, 2]
>> > -osd 1 goes down, OK
>> > -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1,
>> > OK
>> > -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is
>> > incomplete and stopped) not OK, but this is the case...
>> > --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
>> > that it's data is outdated and will cause an inconsistent state?
>> > Wouldn't it be prudent to deny io to pg "A" until either
>> > -osd 2 comes back (therefore we have a clean osd in the acting group)
>> > ... backfill would continue to osd 1 of course
>> > -or data in pg "A" is manually marked as lost, and then continues
>> > operation from osd 1 's (outdated) copy?
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread Denes Dolhay

Hi Greg,

Accepting the fact, that an osd with outdated data can never accept 
write, or io of any kind, how is it possible, that the system goes into 
this state?


-All osds are Bluestore, checksum, mtime etc.

-All osds are up and in

-No hw failures, lost disks, damaged journals or databases etc.

-The data became inconsistent


Thanks,

Denke.


On 11/02/2017 11:51 PM, Gregory Farnum wrote:


On Thu, Nov 2, 2017 at 1:21 AM koukou73gr > wrote:


The scenario is actually a bit different, see:

Let's assume size=2, min_size=1
-We are looking at pg "A" acting [1, 2]
-osd 1 goes down
-osd 2 accepts a write for pg "A"
-osd 2 goes down
-osd 1 comes back up, while osd 2 still down
-osd 1 has no way to know osd 2 accepted a write in pg "A"
-osd 1 accepts a new write to pg "A"
-osd 2 comes back up.

bang! osd 1 and 2 now have different views of pg "A" but both claim to
have current data.


In this case, OSD 1 will not accept IO precisely because it can not 
prove it has the current data. That is the basic purpose of OSD 
peering and holds in all cases.

-Greg



-K.

On 2017-11-01 20:27, Denes Dolhay wrote:
> Hello,
>
> I have a trick question for Mr. Turner's scenario:
> Let's assume size=2, min_size=1
> -We are looking at pg "A" acting [1, 2]
> -osd 1 goes down, OK
> -osd 1 comes back up, backfill of pg "A" commences from osd 2 to
osd 1, OK
> -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is
> incomplete and stopped) not OK, but this is the case...
> --> In this event, why does osd 1 accept IO to pg "A" knowing
full well,
> that it's data is outdated and will cause an inconsistent state?
> Wouldn't it be prudent to deny io to pg "A" until either
> -osd 2 comes back (therefore we have a clean osd in the acting
group)
> ... backfill would continue to osd 1 of course
> -or data in pg "A" is manually marked as lost, and then continues
> operation from osd 1 's (outdated) copy?
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread Gregory Farnum
On Thu, Nov 2, 2017 at 1:21 AM koukou73gr  wrote:

> The scenario is actually a bit different, see:
>
> Let's assume size=2, min_size=1
> -We are looking at pg "A" acting [1, 2]
> -osd 1 goes down
> -osd 2 accepts a write for pg "A"
> -osd 2 goes down
> -osd 1 comes back up, while osd 2 still down
> -osd 1 has no way to know osd 2 accepted a write in pg "A"
> -osd 1 accepts a new write to pg "A"
> -osd 2 comes back up.
>
> bang! osd 1 and 2 now have different views of pg "A" but both claim to
> have current data.


In this case, OSD 1 will not accept IO precisely because it can not prove
it has the current data. That is the basic purpose of OSD peering and holds
in all cases.
-Greg


>
> -K.
>
> On 2017-11-01 20:27, Denes Dolhay wrote:
> > Hello,
> >
> > I have a trick question for Mr. Turner's scenario:
> > Let's assume size=2, min_size=1
> > -We are looking at pg "A" acting [1, 2]
> > -osd 1 goes down, OK
> > -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1,
> OK
> > -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is
> > incomplete and stopped) not OK, but this is the case...
> > --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
> > that it's data is outdated and will cause an inconsistent state?
> > Wouldn't it be prudent to deny io to pg "A" until either
> > -osd 2 comes back (therefore we have a clean osd in the acting group)
> > ... backfill would continue to osd 1 of course
> > -or data in pg "A" is manually marked as lost, and then continues
> > operation from osd 1 's (outdated) copy?
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread Hans van den Bogert
Never mind, I should’ve read the whole thread first.
> On Nov 2, 2017, at 10:50 AM, Hans van den Bogert  wrote:
> 
> 
>> On Nov 1, 2017, at 4:45 PM, David Turner > > wrote:
>> 
>> All it takes for data loss is that an osd on server 1 is marked down and a 
>> write happens to an osd on server 2.  Now the osd on server 2 goes down 
>> before the osd on server 1 has finished backfilling and the first osd 
>> receives a request to modify data in the object that it doesn't know the 
>> current state of.  Tada, you have data loss.
> 
> I’m probably misunderstanding, but if a osd on server 1 is backfilling, and 
> its only candidate to backfill from is an osd on server 2, and the latter 
> goes down; then wouldn’t the osd on server 1 block, i.e., not accept requests 
> to modify, until server 1 comes up again?
> Or is there a ‘hole' here somewhere where server 1 *thinks* it’s done 
> backfilling whereas the osdmap it used to backfill with was out of date?
> 
> Thanks, 
> 
> Hans

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread Hans van den Bogert

> On Nov 1, 2017, at 4:45 PM, David Turner  wrote:
> 
> All it takes for data loss is that an osd on server 1 is marked down and a 
> write happens to an osd on server 2.  Now the osd on server 2 goes down 
> before the osd on server 1 has finished backfilling and the first osd 
> receives a request to modify data in the object that it doesn't know the 
> current state of.  Tada, you have data loss.

I’m probably misunderstanding, but if a osd on server 1 is backfilling, and its 
only candidate to backfill from is an osd on server 2, and the latter goes 
down; then wouldn’t the osd on server 1 block, i.e., not accept requests to 
modify, until server 1 comes up again?
Or is there a ‘hole' here somewhere where server 1 *thinks* it’s done 
backfilling whereas the osdmap it used to backfill with was out of date?

Thanks, 

Hans___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-02 Thread koukou73gr
The scenario is actually a bit different, see:

Let's assume size=2, min_size=1
-We are looking at pg "A" acting [1, 2]
-osd 1 goes down
-osd 2 accepts a write for pg "A"
-osd 2 goes down
-osd 1 comes back up, while osd 2 still down
-osd 1 has no way to know osd 2 accepted a write in pg "A"
-osd 1 accepts a new write to pg "A"
-osd 2 comes back up.

bang! osd 1 and 2 now have different views of pg "A" but both claim to
have current data.

-K.

On 2017-11-01 20:27, Denes Dolhay wrote:
> Hello,
> 
> I have a trick question for Mr. Turner's scenario:
> Let's assume size=2, min_size=1
> -We are looking at pg "A" acting [1, 2]
> -osd 1 goes down, OK
> -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, OK
> -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is
> incomplete and stopped) not OK, but this is the case...
> --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
> that it's data is outdated and will cause an inconsistent state?
> Wouldn't it be prudent to deny io to pg "A" until either
> -osd 2 comes back (therefore we have a clean osd in the acting group)
> ... backfill would continue to osd 1 of course
> -or data in pg "A" is manually marked as lost, and then continues
> operation from osd 1 's (outdated) copy?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread David Turner
In that thread, I really like how Wido puts it.  He takes out any bit of
code paths, bugs, etc...  In reference to size=3 min_size=1 he says,
"Loosing two disks at the same time is something which doesn't happen that
much, but if it happens you don't want to modify any data on the only copy
which you still have left.  Setting min_size to 1 should be a manual action
imho when size = 3 and you loose two copies. In that case YOU decide at
that moment if it is the right course of action."

On Wed, Nov 1, 2017 at 2:40 PM Denes Dolhay  wrote:

> Thanks!
>
> On 11/01/2017 07:30 PM, Gregory Farnum wrote:
>
> On Wed, Nov 1, 2017 at 11:27 AM Denes Dolhay  wrote:
>
>> Hello,
>> I have a trick question for Mr. Turner's scenario:
>> Let's assume size=2, min_size=1
>> -We are looking at pg "A" acting [1, 2]
>> -osd 1 goes down, OK
>> -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, OK
>> -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is incomplete
>> and stopped) not OK, but this is the case...
>> --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
>> that it's data is outdated and will cause an inconsistent state?
>> Wouldn't it be prudent to deny io to pg "A" until either
>> -osd 2 comes back (therefore we have a clean osd in the acting group) ...
>> backfill would continue to osd 1 of course
>> -or data in pg "A" is manually marked as lost, and then continues
>> operation from osd 1 's (outdated) copy?
>>
>
> It does deny IO in that case. I think David was pointing out that if OSD 2
> is actually dead and gone, you've got data loss despite having only lost
> one OSD.
> -Greg
>
>
>>
>> Thanks in advance, I'm really curious!
>>
>> Denes.
>>
>>
>>
>> On 11/01/2017 06:33 PM, Mario Giammarco wrote:
>>
>> I have read your post then read the thread you suggested, very
>> interesting.
>> Then I read again your post and understood better.
>> The most important thing is that even with min_size=1 writes are
>> acknowledged after ceph wrote size=2 copies.
>> In the thread above there is:
>>
>> As David already said, when all OSDs are up and in for a PG Ceph will wait 
>> for ALL OSDs to Ack the write. Writes in RADOS are always synchronous.
>>
>> Only when OSDs go down you need at least min_size OSDs up before writes or 
>> reads are accepted.
>>
>> So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to 
>> take place.
>>
>>
>> You then show me a sequence of events that may happen in some use cases.
>> I tell you my use case which is quite different. We use ceph under
>> proxmox. The servers have disks on raid 5 (I agree that it is better to
>> expose single disks to Ceph but it is late).
>> So it is unlikely that a ceph disk fails because of raid. If a disks fail
>> probabably is because the entire server has failed (and we need to provide
>> business availability in this case) and so it will never come up again so
>> in my situation your sequence of events will never happen.
>> What shocked me is that I did not expect to see so many inconsistencies.
>> Thanks,
>> Mario
>>
>>
>> 2017-11-01 16:45 GMT+01:00 David Turner :
>>
>>> It looks like you're running with a size = 2 and min_size = 1 (the
>>> min_size is a guess, the size is based on how many osds belong to your
>>> problem PGs).  Here's some good reading for you.
>>> https://www.spinics.net/lists/ceph-users/msg32895.html
>>>
>>> Basically the jist is that when running with size = 2 you should assume
>>> that data loss is an eventuality and choose that it is ok for your use
>>> case.  This can be mitigated by using min_size = 2, but then your pool will
>>> block while an OSD is down and you'll have to manually go in and change the
>>> min_size temporarily to perform maintenance.
>>>
>>> All it takes for data loss is that an osd on server 1 is marked down and
>>> a write happens to an osd on server 2.  Now the osd on server 2 goes down
>>> before the osd on server 1 has finished backfilling and the first osd
>>> receives a request to modify data in the object that it doesn't know the
>>> current state of.  Tada, you have data loss.
>>>
>>> How likely is this to happen... eventually it will.  PG subfolder
>>> splitting (if you're using filestore) will occasionally take long enough to
>>> perform the task that the osd is marked down while it's still running, and
>>> this usually happens for some time all over the cluster when it does.
>>> Another option is something that causes segfaults in the osds; another is
>>> restarting a node before all pgs are done backfilling/recovering; OOM
>>> killer; power outages; etc; etc.
>>>
>>> Why does min_size = 2 prevent this?  Because for a write to be
>>> acknowledged by the cluster, it has to be written to every OSD that is up
>>> as long as there are at least min_size available.  This means that every
>>> write is acknowledged by at least 2 osds every time.  If you're running
>>> with size = 2, 

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread David Turner
I don't know.  I've seen several cases where people have inconsistent pgs
that they can't recover from and they didn't lose any disks.  The most
common thread between them is min_size=1.  My postulated scenario might not
be the actual path in the code that leads to it, but something does... and
min_size=1 seems to be the common thread.

On Wed, Nov 1, 2017 at 2:30 PM Gregory Farnum  wrote:

> On Wed, Nov 1, 2017 at 11:27 AM Denes Dolhay  wrote:
>
>> Hello,
>> I have a trick question for Mr. Turner's scenario:
>> Let's assume size=2, min_size=1
>> -We are looking at pg "A" acting [1, 2]
>> -osd 1 goes down, OK
>> -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, OK
>> -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is incomplete
>> and stopped) not OK, but this is the case...
>> --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
>> that it's data is outdated and will cause an inconsistent state?
>> Wouldn't it be prudent to deny io to pg "A" until either
>> -osd 2 comes back (therefore we have a clean osd in the acting group) ...
>> backfill would continue to osd 1 of course
>> -or data in pg "A" is manually marked as lost, and then continues
>> operation from osd 1 's (outdated) copy?
>>
>
> It does deny IO in that case. I think David was pointing out that if OSD 2
> is actually dead and gone, you've got data loss despite having only lost
> one OSD.
> -Greg
>
>
>>
>> Thanks in advance, I'm really curious!
>>
>> Denes.
>>
>>
>>
>> On 11/01/2017 06:33 PM, Mario Giammarco wrote:
>>
>> I have read your post then read the thread you suggested, very
>> interesting.
>> Then I read again your post and understood better.
>> The most important thing is that even with min_size=1 writes are
>> acknowledged after ceph wrote size=2 copies.
>> In the thread above there is:
>>
>> As David already said, when all OSDs are up and in for a PG Ceph will wait 
>> for ALL OSDs to Ack the write. Writes in RADOS are always synchronous.
>>
>> Only when OSDs go down you need at least min_size OSDs up before writes or 
>> reads are accepted.
>>
>> So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to 
>> take place.
>>
>>
>> You then show me a sequence of events that may happen in some use cases.
>> I tell you my use case which is quite different. We use ceph under
>> proxmox. The servers have disks on raid 5 (I agree that it is better to
>> expose single disks to Ceph but it is late).
>> So it is unlikely that a ceph disk fails because of raid. If a disks fail
>> probabably is because the entire server has failed (and we need to provide
>> business availability in this case) and so it will never come up again so
>> in my situation your sequence of events will never happen.
>> What shocked me is that I did not expect to see so many inconsistencies.
>> Thanks,
>> Mario
>>
>>
>> 2017-11-01 16:45 GMT+01:00 David Turner :
>>
>>> It looks like you're running with a size = 2 and min_size = 1 (the
>>> min_size is a guess, the size is based on how many osds belong to your
>>> problem PGs).  Here's some good reading for you.
>>> https://www.spinics.net/lists/ceph-users/msg32895.html
>>>
>>> Basically the jist is that when running with size = 2 you should assume
>>> that data loss is an eventuality and choose that it is ok for your use
>>> case.  This can be mitigated by using min_size = 2, but then your pool will
>>> block while an OSD is down and you'll have to manually go in and change the
>>> min_size temporarily to perform maintenance.
>>>
>>> All it takes for data loss is that an osd on server 1 is marked down and
>>> a write happens to an osd on server 2.  Now the osd on server 2 goes down
>>> before the osd on server 1 has finished backfilling and the first osd
>>> receives a request to modify data in the object that it doesn't know the
>>> current state of.  Tada, you have data loss.
>>>
>>> How likely is this to happen... eventually it will.  PG subfolder
>>> splitting (if you're using filestore) will occasionally take long enough to
>>> perform the task that the osd is marked down while it's still running, and
>>> this usually happens for some time all over the cluster when it does.
>>> Another option is something that causes segfaults in the osds; another is
>>> restarting a node before all pgs are done backfilling/recovering; OOM
>>> killer; power outages; etc; etc.
>>>
>>> Why does min_size = 2 prevent this?  Because for a write to be
>>> acknowledged by the cluster, it has to be written to every OSD that is up
>>> as long as there are at least min_size available.  This means that every
>>> write is acknowledged by at least 2 osds every time.  If you're running
>>> with size = 2, then both copies of the data need to be online for a write
>>> to happen and thus can never have a write that the other does not.  If
>>> you're running with size = 3, then you always have a majority of the OSDs
>>> 

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread Gregory Farnum
On Wed, Nov 1, 2017 at 11:27 AM Denes Dolhay  wrote:

> Hello,
> I have a trick question for Mr. Turner's scenario:
> Let's assume size=2, min_size=1
> -We are looking at pg "A" acting [1, 2]
> -osd 1 goes down, OK
> -osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, OK
> -osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is incomplete
> and stopped) not OK, but this is the case...
> --> In this event, why does osd 1 accept IO to pg "A" knowing full well,
> that it's data is outdated and will cause an inconsistent state?
> Wouldn't it be prudent to deny io to pg "A" until either
> -osd 2 comes back (therefore we have a clean osd in the acting group) ...
> backfill would continue to osd 1 of course
> -or data in pg "A" is manually marked as lost, and then continues
> operation from osd 1 's (outdated) copy?
>

It does deny IO in that case. I think David was pointing out that if OSD 2
is actually dead and gone, you've got data loss despite having only lost
one OSD.
-Greg


>
> Thanks in advance, I'm really curious!
>
> Denes.
>
>
>
> On 11/01/2017 06:33 PM, Mario Giammarco wrote:
>
> I have read your post then read the thread you suggested, very
> interesting.
> Then I read again your post and understood better.
> The most important thing is that even with min_size=1 writes are
> acknowledged after ceph wrote size=2 copies.
> In the thread above there is:
>
> As David already said, when all OSDs are up and in for a PG Ceph will wait 
> for ALL OSDs to Ack the write. Writes in RADOS are always synchronous.
>
> Only when OSDs go down you need at least min_size OSDs up before writes or 
> reads are accepted.
>
> So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to 
> take place.
>
>
> You then show me a sequence of events that may happen in some use cases.
> I tell you my use case which is quite different. We use ceph under
> proxmox. The servers have disks on raid 5 (I agree that it is better to
> expose single disks to Ceph but it is late).
> So it is unlikely that a ceph disk fails because of raid. If a disks fail
> probabably is because the entire server has failed (and we need to provide
> business availability in this case) and so it will never come up again so
> in my situation your sequence of events will never happen.
> What shocked me is that I did not expect to see so many inconsistencies.
> Thanks,
> Mario
>
>
> 2017-11-01 16:45 GMT+01:00 David Turner :
>
>> It looks like you're running with a size = 2 and min_size = 1 (the
>> min_size is a guess, the size is based on how many osds belong to your
>> problem PGs).  Here's some good reading for you.
>> https://www.spinics.net/lists/ceph-users/msg32895.html
>>
>> Basically the jist is that when running with size = 2 you should assume
>> that data loss is an eventuality and choose that it is ok for your use
>> case.  This can be mitigated by using min_size = 2, but then your pool will
>> block while an OSD is down and you'll have to manually go in and change the
>> min_size temporarily to perform maintenance.
>>
>> All it takes for data loss is that an osd on server 1 is marked down and
>> a write happens to an osd on server 2.  Now the osd on server 2 goes down
>> before the osd on server 1 has finished backfilling and the first osd
>> receives a request to modify data in the object that it doesn't know the
>> current state of.  Tada, you have data loss.
>>
>> How likely is this to happen... eventually it will.  PG subfolder
>> splitting (if you're using filestore) will occasionally take long enough to
>> perform the task that the osd is marked down while it's still running, and
>> this usually happens for some time all over the cluster when it does.
>> Another option is something that causes segfaults in the osds; another is
>> restarting a node before all pgs are done backfilling/recovering; OOM
>> killer; power outages; etc; etc.
>>
>> Why does min_size = 2 prevent this?  Because for a write to be
>> acknowledged by the cluster, it has to be written to every OSD that is up
>> as long as there are at least min_size available.  This means that every
>> write is acknowledged by at least 2 osds every time.  If you're running
>> with size = 2, then both copies of the data need to be online for a write
>> to happen and thus can never have a write that the other does not.  If
>> you're running with size = 3, then you always have a majority of the OSDs
>> online receiving a write and they can both agree on the correct data to
>> give to the third when it comes back up.
>>
>> On Wed, Nov 1, 2017 at 3:31 AM Mario Giammarco 
>> wrote:
>>
>>> Sure here it is ceph -s:
>>>
>>> cluster:
>>>id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
>>>health: HEALTH_ERR
>>>100 scrub errors
>>>Possible data damage: 56 pgs inconsistent
>>>
>>>  services:
>>>mon: 3 daemons, quorum 0,1,pve3
>>>mgr: pve3(active)
>>>osd: 3 

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread Denes Dolhay

Hello,

I have a trick question for Mr. Turner's scenario:
Let's assume size=2, min_size=1
-We are looking at pg "A" acting [1, 2]
-osd 1 goes down, OK
-osd 1 comes back up, backfill of pg "A" commences from osd 2 to osd 1, OK
-osd 2 goes down (and therefore pg "A" 's backfill to osd 1 is 
incomplete and stopped) not OK, but this is the case...
--> In this event, why does osd 1 accept IO to pg "A" knowing full well, 
that it's data is outdated and will cause an inconsistent state?

Wouldn't it be prudent to deny io to pg "A" until either
-osd 2 comes back (therefore we have a clean osd in the acting group) 
... backfill would continue to osd 1 of course
-or data in pg "A" is manually marked as lost, and then continues 
operation from osd 1 's (outdated) copy?


Thanks in advance, I'm really curious!
Denes.


On 11/01/2017 06:33 PM, Mario Giammarco wrote:
I have read your post then read the thread you suggested, very 
interesting.

Then I read again your post and understood better.
The most important thing is that even with min_size=1 writes are 
acknowledged after ceph wrote size=2 copies.

In the thread above there is:
As David already said, when all OSDs are up and in for a PG Ceph will wait for 
ALL OSDs to Ack the write. Writes in RADOS are always synchronous.

Only when OSDs go down you need at least min_size OSDs up before writes or 
reads are accepted.

So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to take 
place.

You then show me a sequence of events that may happen in some use cases.
I tell you my use case which is quite different. We use ceph under 
proxmox. The servers have disks on raid 5 (I agree that it is better 
to expose single disks to Ceph but it is late).
So it is unlikely that a ceph disk fails because of raid. If a disks 
fail probabably is because the entire server has failed (and we need 
to provide business availability in this case) and so it will never 
come up again so in my situation your sequence of events will never 
happen.

What shocked me is that I did not expect to see so many inconsistencies.
Thanks,
Mario


2017-11-01 16:45 GMT+01:00 David Turner >:


It looks like you're running with a size = 2 and min_size = 1 (the
min_size is a guess, the size is based on how many osds belong to
your problem PGs). Here's some good reading for you.
https://www.spinics.net/lists/ceph-users/msg32895.html


Basically the jist is that when running with size = 2 you should
assume that data loss is an eventuality and choose that it is ok
for your use case.  This can be mitigated by using min_size = 2,
but then your pool will block while an OSD is down and you'll have
to manually go in and change the min_size temporarily to perform
maintenance.

All it takes for data loss is that an osd on server 1 is marked
down and a write happens to an osd on server 2.  Now the osd on
server 2 goes down before the osd on server 1 has finished
backfilling and the first osd receives a request to modify data in
the object that it doesn't know the current state of.  Tada, you
have data loss.

How likely is this to happen... eventually it will. PG subfolder
splitting (if you're using filestore) will occasionally take long
enough to perform the task that the osd is marked down while it's
still running, and this usually happens for some time all over the
cluster when it does.  Another option is something that causes
segfaults in the osds; another is restarting a node before all pgs
are done backfilling/recovering; OOM killer; power outages; etc; etc.

Why does min_size = 2 prevent this?  Because for a write to be
acknowledged by the cluster, it has to be written to every OSD
that is up as long as there are at least min_size available.  This
means that every write is acknowledged by at least 2 osds every
time.  If you're running with size = 2, then both copies of the
data need to be online for a write to happen and thus can never
have a write that the other does not.  If you're running with size
= 3, then you always have a majority of the OSDs online receiving
a write and they can both agree on the correct data to give to the
third when it comes back up.

On Wed, Nov 1, 2017 at 3:31 AM Mario Giammarco
> wrote:

Sure here it is ceph -s:

cluster:
   id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
   health: HEALTH_ERR
   100 scrub errors
   Possible data damage: 56 pgs inconsistent

 services:
   mon: 3 daemons, quorum 0,1,pve3
   mgr: pve3(active)
   osd: 3 osds: 3 up, 3 in

 data:
   pools:   1 pools, 256 pgs
   objects: 269k objects, 1007 GB
   usage:   2050 GB 

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread David Turner
RAID may make it likely that disk failures aren't going to be the cause of
your data loss, but none of my examples referred to hardware failure.  The
daemon and the code having issues causing OSDs to restart or just not
respond long enough to be marked down.  Data loss in this case isn't
talking about losing the disks, but not being able to trust the data you
have on the disk.

On Wed, Nov 1, 2017 at 1:33 PM Mario Giammarco  wrote:

> I have read your post then read the thread you suggested, very interesting.
> Then I read again your post and understood better.
> The most important thing is that even with min_size=1 writes are
> acknowledged after ceph wrote size=2 copies.
> In the thread above there is:
>
> As David already said, when all OSDs are up and in for a PG Ceph will wait 
> for ALL OSDs to Ack the write. Writes in RADOS are always synchronous.
>
> Only when OSDs go down you need at least min_size OSDs up before writes or 
> reads are accepted.
>
> So if min_size = 2 and size = 3 you need at least 2 OSDs online for I/O to 
> take place.
>
>
> You then show me a sequence of events that may happen in some use cases.
> I tell you my use case which is quite different. We use ceph under
> proxmox. The servers have disks on raid 5 (I agree that it is better to
> expose single disks to Ceph but it is late).
> So it is unlikely that a ceph disk fails because of raid. If a disks fail
> probabably is because the entire server has failed (and we need to provide
> business availability in this case) and so it will never come up again so
> in my situation your sequence of events will never happen.
> What shocked me is that I did not expect to see so many inconsistencies.
> Thanks,
> Mario
>
>
> 2017-11-01 16:45 GMT+01:00 David Turner :
>
>> It looks like you're running with a size = 2 and min_size = 1 (the
>> min_size is a guess, the size is based on how many osds belong to your
>> problem PGs).  Here's some good reading for you.
>> https://www.spinics.net/lists/ceph-users/msg32895.html
>>
>> Basically the jist is that when running with size = 2 you should assume
>> that data loss is an eventuality and choose that it is ok for your use
>> case.  This can be mitigated by using min_size = 2, but then your pool will
>> block while an OSD is down and you'll have to manually go in and change the
>> min_size temporarily to perform maintenance.
>>
>> All it takes for data loss is that an osd on server 1 is marked down and
>> a write happens to an osd on server 2.  Now the osd on server 2 goes down
>> before the osd on server 1 has finished backfilling and the first osd
>> receives a request to modify data in the object that it doesn't know the
>> current state of.  Tada, you have data loss.
>>
>> How likely is this to happen... eventually it will.  PG subfolder
>> splitting (if you're using filestore) will occasionally take long enough to
>> perform the task that the osd is marked down while it's still running, and
>> this usually happens for some time all over the cluster when it does.
>> Another option is something that causes segfaults in the osds; another is
>> restarting a node before all pgs are done backfilling/recovering; OOM
>> killer; power outages; etc; etc.
>>
>> Why does min_size = 2 prevent this?  Because for a write to be
>> acknowledged by the cluster, it has to be written to every OSD that is up
>> as long as there are at least min_size available.  This means that every
>> write is acknowledged by at least 2 osds every time.  If you're running
>> with size = 2, then both copies of the data need to be online for a write
>> to happen and thus can never have a write that the other does not.  If
>> you're running with size = 3, then you always have a majority of the OSDs
>> online receiving a write and they can both agree on the correct data to
>> give to the third when it comes back up.
>>
>> On Wed, Nov 1, 2017 at 3:31 AM Mario Giammarco 
>> wrote:
>>
>>> Sure here it is ceph -s:
>>>
>>> cluster:
>>>id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
>>>health: HEALTH_ERR
>>>100 scrub errors
>>>Possible data damage: 56 pgs inconsistent
>>>
>>>  services:
>>>mon: 3 daemons, quorum 0,1,pve3
>>>mgr: pve3(active)
>>>osd: 3 osds: 3 up, 3 in
>>>
>>>  data:
>>>pools:   1 pools, 256 pgs
>>>objects: 269k objects, 1007 GB
>>>usage:   2050 GB used, 1386 GB / 3436 GB avail
>>>pgs: 200 active+clean
>>> 56  active+clean+inconsistent
>>>
>>> ---
>>>
>>> ceph health detail :
>>>
>>> PG_DAMAGED Possible data damage: 56 pgs inconsistent
>>>pg 2.6 is active+clean+inconsistent, acting [1,0]
>>>pg 2.19 is active+clean+inconsistent, acting [1,2]
>>>pg 2.1e is active+clean+inconsistent, acting [1,2]
>>>pg 2.1f is active+clean+inconsistent, acting [1,2]
>>>pg 2.24 is active+clean+inconsistent, acting [0,2]
>>>pg 2.25 is active+clean+inconsistent, acting [2,0]
>>>

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread Mario Giammarco
I have read your post then read the thread you suggested, very interesting.
Then I read again your post and understood better.
The most important thing is that even with min_size=1 writes are
acknowledged after ceph wrote size=2 copies.
In the thread above there is:

As David already said, when all OSDs are up and in for a PG Ceph will
wait for ALL OSDs to Ack the write. Writes in RADOS are always
synchronous.

Only when OSDs go down you need at least min_size OSDs up before
writes or reads are accepted.

So if min_size = 2 and size = 3 you need at least 2 OSDs online for
I/O to take place.


You then show me a sequence of events that may happen in some use cases.
I tell you my use case which is quite different. We use ceph under proxmox.
The servers have disks on raid 5 (I agree that it is better to expose
single disks to Ceph but it is late).
So it is unlikely that a ceph disk fails because of raid. If a disks fail
probabably is because the entire server has failed (and we need to provide
business availability in this case) and so it will never come up again so
in my situation your sequence of events will never happen.
What shocked me is that I did not expect to see so many inconsistencies.
Thanks,
Mario


2017-11-01 16:45 GMT+01:00 David Turner :

> It looks like you're running with a size = 2 and min_size = 1 (the
> min_size is a guess, the size is based on how many osds belong to your
> problem PGs).  Here's some good reading for you.  https://www.spinics.net/
> lists/ceph-users/msg32895.html
>
> Basically the jist is that when running with size = 2 you should assume
> that data loss is an eventuality and choose that it is ok for your use
> case.  This can be mitigated by using min_size = 2, but then your pool will
> block while an OSD is down and you'll have to manually go in and change the
> min_size temporarily to perform maintenance.
>
> All it takes for data loss is that an osd on server 1 is marked down and a
> write happens to an osd on server 2.  Now the osd on server 2 goes down
> before the osd on server 1 has finished backfilling and the first osd
> receives a request to modify data in the object that it doesn't know the
> current state of.  Tada, you have data loss.
>
> How likely is this to happen... eventually it will.  PG subfolder
> splitting (if you're using filestore) will occasionally take long enough to
> perform the task that the osd is marked down while it's still running, and
> this usually happens for some time all over the cluster when it does.
> Another option is something that causes segfaults in the osds; another is
> restarting a node before all pgs are done backfilling/recovering; OOM
> killer; power outages; etc; etc.
>
> Why does min_size = 2 prevent this?  Because for a write to be
> acknowledged by the cluster, it has to be written to every OSD that is up
> as long as there are at least min_size available.  This means that every
> write is acknowledged by at least 2 osds every time.  If you're running
> with size = 2, then both copies of the data need to be online for a write
> to happen and thus can never have a write that the other does not.  If
> you're running with size = 3, then you always have a majority of the OSDs
> online receiving a write and they can both agree on the correct data to
> give to the third when it comes back up.
>
> On Wed, Nov 1, 2017 at 3:31 AM Mario Giammarco 
> wrote:
>
>> Sure here it is ceph -s:
>>
>> cluster:
>>id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
>>health: HEALTH_ERR
>>100 scrub errors
>>Possible data damage: 56 pgs inconsistent
>>
>>  services:
>>mon: 3 daemons, quorum 0,1,pve3
>>mgr: pve3(active)
>>osd: 3 osds: 3 up, 3 in
>>
>>  data:
>>pools:   1 pools, 256 pgs
>>objects: 269k objects, 1007 GB
>>usage:   2050 GB used, 1386 GB / 3436 GB avail
>>pgs: 200 active+clean
>> 56  active+clean+inconsistent
>>
>> ---
>>
>> ceph health detail :
>>
>> PG_DAMAGED Possible data damage: 56 pgs inconsistent
>>pg 2.6 is active+clean+inconsistent, acting [1,0]
>>pg 2.19 is active+clean+inconsistent, acting [1,2]
>>pg 2.1e is active+clean+inconsistent, acting [1,2]
>>pg 2.1f is active+clean+inconsistent, acting [1,2]
>>pg 2.24 is active+clean+inconsistent, acting [0,2]
>>pg 2.25 is active+clean+inconsistent, acting [2,0]
>>pg 2.36 is active+clean+inconsistent, acting [1,0]
>>pg 2.3d is active+clean+inconsistent, acting [1,2]
>>pg 2.4b is active+clean+inconsistent, acting [1,0]
>>pg 2.4c is active+clean+inconsistent, acting [0,2]
>>pg 2.4d is active+clean+inconsistent, acting [1,2]
>>pg 2.4f is active+clean+inconsistent, acting [1,2]
>>pg 2.50 is active+clean+inconsistent, acting [1,2]
>>pg 2.52 is active+clean+inconsistent, acting [1,2]
>>pg 2.56 is active+clean+inconsistent, acting [1,0]
>>pg 2.5b is active+clean+inconsistent, acting [1,2]
>>pg 2.5c is 

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread Gregory Farnum
Okay, so just to be clear you *haven't* run pg repair yet?

These PG copies look wildly different, but maybe I'm misunderstanding
something about the output.

I would run the repair first and see if that makes things happy. If you're
running on Bluestore, it will *not* break anything or "repair" with the
wrong data. :)
-Greg

On Wed, Nov 1, 2017 at 12:31 AM Mario Giammarco 
wrote:

> Sure here it is ceph -s:
>
> cluster:
>id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
>health: HEALTH_ERR
>100 scrub errors
>Possible data damage: 56 pgs inconsistent
>
>  services:
>mon: 3 daemons, quorum 0,1,pve3
>mgr: pve3(active)
>osd: 3 osds: 3 up, 3 in
>
>  data:
>pools:   1 pools, 256 pgs
>objects: 269k objects, 1007 GB
>usage:   2050 GB used, 1386 GB / 3436 GB avail
>pgs: 200 active+clean
> 56  active+clean+inconsistent
>
> ---
>
> ceph health detail :
>
> PG_DAMAGED Possible data damage: 56 pgs inconsistent
>pg 2.6 is active+clean+inconsistent, acting [1,0]
>pg 2.19 is active+clean+inconsistent, acting [1,2]
>pg 2.1e is active+clean+inconsistent, acting [1,2]
>pg 2.1f is active+clean+inconsistent, acting [1,2]
>pg 2.24 is active+clean+inconsistent, acting [0,2]
>pg 2.25 is active+clean+inconsistent, acting [2,0]
>pg 2.36 is active+clean+inconsistent, acting [1,0]
>pg 2.3d is active+clean+inconsistent, acting [1,2]
>pg 2.4b is active+clean+inconsistent, acting [1,0]
>pg 2.4c is active+clean+inconsistent, acting [0,2]
>pg 2.4d is active+clean+inconsistent, acting [1,2]
>pg 2.4f is active+clean+inconsistent, acting [1,2]
>pg 2.50 is active+clean+inconsistent, acting [1,2]
>pg 2.52 is active+clean+inconsistent, acting [1,2]
>pg 2.56 is active+clean+inconsistent, acting [1,0]
>pg 2.5b is active+clean+inconsistent, acting [1,2]
>pg 2.5c is active+clean+inconsistent, acting [1,2]
>pg 2.5d is active+clean+inconsistent, acting [1,0]
>pg 2.5f is active+clean+inconsistent, acting [1,2]
>pg 2.71 is active+clean+inconsistent, acting [0,2]
>pg 2.75 is active+clean+inconsistent, acting [1,2]
>pg 2.77 is active+clean+inconsistent, acting [1,2]
>pg 2.79 is active+clean+inconsistent, acting [1,2]
>pg 2.7e is active+clean+inconsistent, acting [1,2]
>pg 2.83 is active+clean+inconsistent, acting [1,0]
>pg 2.8a is active+clean+inconsistent, acting [1,0]
>pg 2.92 is active+clean+inconsistent, acting [1,2]
>pg 2.98 is active+clean+inconsistent, acting [1,0]
>pg 2.9a is active+clean+inconsistent, acting [1,0]
>pg 2.9e is active+clean+inconsistent, acting [1,0]
>pg 2.9f is active+clean+inconsistent, acting [1,2]
>pg 2.c6 is active+clean+inconsistent, acting [0,2]
>pg 2.c7 is active+clean+inconsistent, acting [1,0]
>pg 2.c8 is active+clean+inconsistent, acting [1,2]
>pg 2.cb is active+clean+inconsistent, acting [1,2]
>pg 2.cd is active+clean+inconsistent, acting [1,2]
>pg 2.ce is active+clean+inconsistent, acting [1,2]
>pg 2.d2 is active+clean+inconsistent, acting [2,1]
>pg 2.da is active+clean+inconsistent, acting [1,0]
>pg 2.de is active+clean+inconsistent, acting [1,2]
>pg 2.e1 is active+clean+inconsistent, acting [1,2]
>pg 2.e4 is active+clean+inconsistent, acting [1,0]
>pg 2.e6 is active+clean+inconsistent, acting [0,2]
>pg 2.e8 is active+clean+inconsistent, acting [1,2]
>pg 2.ee is active+clean+inconsistent, acting [1,0]
>pg 2.f9 is active+clean+inconsistent, acting [1,2]
>pg 2.fa is active+clean+inconsistent, acting [1,0]
>pg 2.fb is active+clean+inconsistent, acting [1,2]
>pg 2.fc is active+clean+inconsistent, acting [1,2]
>pg 2.fe is active+clean+inconsistent, acting [1,0]
>pg 2.ff is active+clean+inconsistent, acting [1,0]
>
>
> and ceph pg 2.6 query:
>
> {
>"state": "active+clean+inconsistent",
>"snap_trimq": "[]",
>"epoch": 1513,
>"up": [
>1,
>0
>],
>"acting": [
>1,
>0
>],
>"actingbackfill": [
>"0",
>"1"
>],
>"info": {
>"pgid": "2.6",
>"last_update": "1513'89145",
>"last_complete": "1513'89145",
>"log_tail": "1503'87586",
>"last_user_version": 330583,
>"last_backfill": "MAX",
>"last_backfill_bitwise": 0,
>"purged_snaps": [
>{
>"start": "1",
>"length": "178"
>},
>{
>"start": "17a",
>"length": "3d"
>},
>{
>"start": "1b8",
>"length": "1"
>},
>{
>"start": "1ba",
>"length": "1"
>},
>{
>"start": "1bc",
>"length": "1"
>},
>{
>"start": "1be",
>"length": "44"
>},
>

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread David Turner
It looks like you're running with a size = 2 and min_size = 1 (the min_size
is a guess, the size is based on how many osds belong to your problem
PGs).  Here's some good reading for you.
https://www.spinics.net/lists/ceph-users/msg32895.html

Basically the jist is that when running with size = 2 you should assume
that data loss is an eventuality and choose that it is ok for your use
case.  This can be mitigated by using min_size = 2, but then your pool will
block while an OSD is down and you'll have to manually go in and change the
min_size temporarily to perform maintenance.

All it takes for data loss is that an osd on server 1 is marked down and a
write happens to an osd on server 2.  Now the osd on server 2 goes down
before the osd on server 1 has finished backfilling and the first osd
receives a request to modify data in the object that it doesn't know the
current state of.  Tada, you have data loss.

How likely is this to happen... eventually it will.  PG subfolder splitting
(if you're using filestore) will occasionally take long enough to perform
the task that the osd is marked down while it's still running, and this
usually happens for some time all over the cluster when it does.  Another
option is something that causes segfaults in the osds; another is
restarting a node before all pgs are done backfilling/recovering; OOM
killer; power outages; etc; etc.

Why does min_size = 2 prevent this?  Because for a write to be acknowledged
by the cluster, it has to be written to every OSD that is up as long as
there are at least min_size available.  This means that every write is
acknowledged by at least 2 osds every time.  If you're running with size =
2, then both copies of the data need to be online for a write to happen and
thus can never have a write that the other does not.  If you're running
with size = 3, then you always have a majority of the OSDs online receiving
a write and they can both agree on the correct data to give to the third
when it comes back up.

On Wed, Nov 1, 2017 at 3:31 AM Mario Giammarco  wrote:

> Sure here it is ceph -s:
>
> cluster:
>id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
>health: HEALTH_ERR
>100 scrub errors
>Possible data damage: 56 pgs inconsistent
>
>  services:
>mon: 3 daemons, quorum 0,1,pve3
>mgr: pve3(active)
>osd: 3 osds: 3 up, 3 in
>
>  data:
>pools:   1 pools, 256 pgs
>objects: 269k objects, 1007 GB
>usage:   2050 GB used, 1386 GB / 3436 GB avail
>pgs: 200 active+clean
> 56  active+clean+inconsistent
>
> ---
>
> ceph health detail :
>
> PG_DAMAGED Possible data damage: 56 pgs inconsistent
>pg 2.6 is active+clean+inconsistent, acting [1,0]
>pg 2.19 is active+clean+inconsistent, acting [1,2]
>pg 2.1e is active+clean+inconsistent, acting [1,2]
>pg 2.1f is active+clean+inconsistent, acting [1,2]
>pg 2.24 is active+clean+inconsistent, acting [0,2]
>pg 2.25 is active+clean+inconsistent, acting [2,0]
>pg 2.36 is active+clean+inconsistent, acting [1,0]
>pg 2.3d is active+clean+inconsistent, acting [1,2]
>pg 2.4b is active+clean+inconsistent, acting [1,0]
>pg 2.4c is active+clean+inconsistent, acting [0,2]
>pg 2.4d is active+clean+inconsistent, acting [1,2]
>pg 2.4f is active+clean+inconsistent, acting [1,2]
>pg 2.50 is active+clean+inconsistent, acting [1,2]
>pg 2.52 is active+clean+inconsistent, acting [1,2]
>pg 2.56 is active+clean+inconsistent, acting [1,0]
>pg 2.5b is active+clean+inconsistent, acting [1,2]
>pg 2.5c is active+clean+inconsistent, acting [1,2]
>pg 2.5d is active+clean+inconsistent, acting [1,0]
>pg 2.5f is active+clean+inconsistent, acting [1,2]
>pg 2.71 is active+clean+inconsistent, acting [0,2]
>pg 2.75 is active+clean+inconsistent, acting [1,2]
>pg 2.77 is active+clean+inconsistent, acting [1,2]
>pg 2.79 is active+clean+inconsistent, acting [1,2]
>pg 2.7e is active+clean+inconsistent, acting [1,2]
>pg 2.83 is active+clean+inconsistent, acting [1,0]
>pg 2.8a is active+clean+inconsistent, acting [1,0]
>pg 2.92 is active+clean+inconsistent, acting [1,2]
>pg 2.98 is active+clean+inconsistent, acting [1,0]
>pg 2.9a is active+clean+inconsistent, acting [1,0]
>pg 2.9e is active+clean+inconsistent, acting [1,0]
>pg 2.9f is active+clean+inconsistent, acting [1,2]
>pg 2.c6 is active+clean+inconsistent, acting [0,2]
>pg 2.c7 is active+clean+inconsistent, acting [1,0]
>pg 2.c8 is active+clean+inconsistent, acting [1,2]
>pg 2.cb is active+clean+inconsistent, acting [1,2]
>pg 2.cd is active+clean+inconsistent, acting [1,2]
>pg 2.ce is active+clean+inconsistent, acting [1,2]
>pg 2.d2 is active+clean+inconsistent, acting [2,1]
>pg 2.da is active+clean+inconsistent, acting [1,0]
>pg 2.de is active+clean+inconsistent, acting [1,2]
>pg 2.e1 is active+clean+inconsistent, acting [1,2]
>pg 2.e4 is 

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-11-01 Thread Mario Giammarco
Sure here it is ceph -s:

cluster:
   id: 8bc45d9a-ef50-4038-8e1b-1f25ac46c945
   health: HEALTH_ERR
   100 scrub errors
   Possible data damage: 56 pgs inconsistent

 services:
   mon: 3 daemons, quorum 0,1,pve3
   mgr: pve3(active)
   osd: 3 osds: 3 up, 3 in

 data:
   pools:   1 pools, 256 pgs
   objects: 269k objects, 1007 GB
   usage:   2050 GB used, 1386 GB / 3436 GB avail
   pgs: 200 active+clean
56  active+clean+inconsistent

---

ceph health detail :

PG_DAMAGED Possible data damage: 56 pgs inconsistent
   pg 2.6 is active+clean+inconsistent, acting [1,0]
   pg 2.19 is active+clean+inconsistent, acting [1,2]
   pg 2.1e is active+clean+inconsistent, acting [1,2]
   pg 2.1f is active+clean+inconsistent, acting [1,2]
   pg 2.24 is active+clean+inconsistent, acting [0,2]
   pg 2.25 is active+clean+inconsistent, acting [2,0]
   pg 2.36 is active+clean+inconsistent, acting [1,0]
   pg 2.3d is active+clean+inconsistent, acting [1,2]
   pg 2.4b is active+clean+inconsistent, acting [1,0]
   pg 2.4c is active+clean+inconsistent, acting [0,2]
   pg 2.4d is active+clean+inconsistent, acting [1,2]
   pg 2.4f is active+clean+inconsistent, acting [1,2]
   pg 2.50 is active+clean+inconsistent, acting [1,2]
   pg 2.52 is active+clean+inconsistent, acting [1,2]
   pg 2.56 is active+clean+inconsistent, acting [1,0]
   pg 2.5b is active+clean+inconsistent, acting [1,2]
   pg 2.5c is active+clean+inconsistent, acting [1,2]
   pg 2.5d is active+clean+inconsistent, acting [1,0]
   pg 2.5f is active+clean+inconsistent, acting [1,2]
   pg 2.71 is active+clean+inconsistent, acting [0,2]
   pg 2.75 is active+clean+inconsistent, acting [1,2]
   pg 2.77 is active+clean+inconsistent, acting [1,2]
   pg 2.79 is active+clean+inconsistent, acting [1,2]
   pg 2.7e is active+clean+inconsistent, acting [1,2]
   pg 2.83 is active+clean+inconsistent, acting [1,0]
   pg 2.8a is active+clean+inconsistent, acting [1,0]
   pg 2.92 is active+clean+inconsistent, acting [1,2]
   pg 2.98 is active+clean+inconsistent, acting [1,0]
   pg 2.9a is active+clean+inconsistent, acting [1,0]
   pg 2.9e is active+clean+inconsistent, acting [1,0]
   pg 2.9f is active+clean+inconsistent, acting [1,2]
   pg 2.c6 is active+clean+inconsistent, acting [0,2]
   pg 2.c7 is active+clean+inconsistent, acting [1,0]
   pg 2.c8 is active+clean+inconsistent, acting [1,2]
   pg 2.cb is active+clean+inconsistent, acting [1,2]
   pg 2.cd is active+clean+inconsistent, acting [1,2]
   pg 2.ce is active+clean+inconsistent, acting [1,2]
   pg 2.d2 is active+clean+inconsistent, acting [2,1]
   pg 2.da is active+clean+inconsistent, acting [1,0]
   pg 2.de is active+clean+inconsistent, acting [1,2]
   pg 2.e1 is active+clean+inconsistent, acting [1,2]
   pg 2.e4 is active+clean+inconsistent, acting [1,0]
   pg 2.e6 is active+clean+inconsistent, acting [0,2]
   pg 2.e8 is active+clean+inconsistent, acting [1,2]
   pg 2.ee is active+clean+inconsistent, acting [1,0]
   pg 2.f9 is active+clean+inconsistent, acting [1,2]
   pg 2.fa is active+clean+inconsistent, acting [1,0]
   pg 2.fb is active+clean+inconsistent, acting [1,2]
   pg 2.fc is active+clean+inconsistent, acting [1,2]
   pg 2.fe is active+clean+inconsistent, acting [1,0]
   pg 2.ff is active+clean+inconsistent, acting [1,0]


and ceph pg 2.6 query:

{
   "state": "active+clean+inconsistent",
   "snap_trimq": "[]",
   "epoch": 1513,
   "up": [
   1,
   0
   ],
   "acting": [
   1,
   0
   ],
   "actingbackfill": [
   "0",
   "1"
   ],
   "info": {
   "pgid": "2.6",
   "last_update": "1513'89145",
   "last_complete": "1513'89145",
   "log_tail": "1503'87586",
   "last_user_version": 330583,
   "last_backfill": "MAX",
   "last_backfill_bitwise": 0,
   "purged_snaps": [
   {
   "start": "1",
   "length": "178"
   },
   {
   "start": "17a",
   "length": "3d"
   },
   {
   "start": "1b8",
   "length": "1"
   },
   {
   "start": "1ba",
   "length": "1"
   },
   {
   "start": "1bc",
   "length": "1"
   },
   {
   "start": "1be",
   "length": "44"
   },
   {
   "start": "205",
   "length": "12c"
   },
   {
   "start": "332",
   "length": "1"
   },
   {
   "start": "334",
   "length": "1"
   },
   {
   "start": "336",
   "length": "1"
   },
   {
   "start": "338",
   "length": "1"
   },
   {
   "start": "33a",
   "length": "1"
   }
   ],
   "history": {
   "epoch_created": 90,
   "epoch_pool_created": 90,
   "last_epoch_started": 1339,

Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-10-30 Thread Gregory Farnum
You'll need to tell us exactly what error messages you're seeing, what the
output of ceph -s is, and the output of pg query for the relevant PGs.
There's not a lot of documentation because much of this tooling is new,
it's changing quickly, and most people don't have the kinds of problems
that turn out to be unrepairable. We should do better about that, though.
-Greg

On Mon, Oct 30, 2017, 11:40 AM Mario Giammarco  wrote:

>  >[Questions to the list]
>  >How is it possible that the cluster cannot repair itself with ceph pg
> repair?
>  >No good copies are remaining?
>  >Cannot decide which copy is valid or up-to date?
>  >If so, why not, when there is checksum, mtime for everything?
>  >In this inconsistent state which object does the cluster serve when it
> doesn't know which one is the valid?
>
>
> I am asking the same questions too, it seems strange to me that in a
> fault tolerant clustered file storage like Ceph there is no
> documentation about this.
>
> I know that I am pedantic but please note that saying "to be sure use
> three copies" is not enough because I am not sure what Ceph really does
> when three copies are not matching.
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-10-30 Thread Mario Giammarco

>[Questions to the list]
>How is it possible that the cluster cannot repair itself with ceph pg 
repair?

>No good copies are remaining?
>Cannot decide which copy is valid or up-to date?
>If so, why not, when there is checksum, mtime for everything?
>In this inconsistent state which object does the cluster serve when it 
doesn't know which one is the valid?



I am asking the same questions too, it seems strange to me that in a 
fault tolerant clustered file storage like Ceph there is no 
documentation about this.


I know that I am pedantic but please note that saying "to be sure use 
three copies" is not enough because I am not sure what Ceph really does 
when three copies are not matching.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-10-30 Thread Mario Giammarco

>In general you should find that clusters running bluestore are much more
>effective about doing a repair automatically (because bluestore has
>checksums on all data, it knows which object is correct!), but there are
>still some situations where they won't. If that happens to you, I 
would not

>follow directions to resolve it unless they have the *exact* same symptoms
>you do, or you've corresponded with the list about it. :)

Thanks, but it is happening to me, so what can I do?

BTW: I suppose that in my case problem is due by bitrot because in my 
test cluster I had  two disks with unreadable sectors and bluestore 
completely discarded them and put them out of cluster


So how does bluestore repair a pg? Does it move in another place of hdd?


Thanks,

Mario

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-10-30 Thread Gregory Farnum
On Sat, Oct 28, 2017 at 5:38 AM Denes Dolhay  wrote:

> Hello,
>
> First of all, I would recommend, that you use ceph pg repair wherever you
> can.
>
>
> When you have size=3 the cluster can compare 3 instances, therefore it is
> easier for it to spot which two is good, and which one is bad.
>
> When you use size=2 the case is harder for o-so-many ways:
>
> -According to the documentation it is harder to determine which object is
> the faulty.
>
> -If an osd dies the increased load (caused by the missing osd) and the
> extra io from the recovery process hits the other osd much harder,
> increasing the chance that another osd dies (because of disk failure caused
> by the sudden spike of extra load), and then you loose your data
>
> -If there is a bitrot in the remaining one replica, then you do not have
> any valid copies for your data
>
> So, to summarize it, the experts say, that it is MUCH safer to have size=3
> min_size=2 (I'm far from an expert, I'm just quoting :))
>
>
> So, back to the task at hand:
>
> If you repaired all pgs that you coud by ceph pg repair, there is a manual
> recovery process, (written for filestore unfortunately):
>
> http://ceph.com/geen-categorie/ceph-manually-repair-object/
>
> The good news is, that there is a fuse client for bluestore too, so you
> can mount it by hand and repair it as per the linked document,
>

Do not do this with bluestore. In general, if you need to edit stuff, it's
probably better to use the ceph-objectstore-tool, as it leaves the store in
a consistent state.

In general you should find that clusters running bluestore are much more
effective about doing a repair automatically (because bluestore has
checksums on all data, it knows which object is correct!), but there are
still some situations where they won't. If that happens to you, I would not
follow directions to resolve it unless they have the *exact* same symptoms
you do, or you've corresponded with the list about it. :)
-Greg


>
> I think that you could ceph osd pool set [pool] size 3 yo increase the
> copy count, but before that you should be certain that you have enough free
> space, and you'll not hit the osd pg count limits.
>
>
> [DISCLAIMER]:
> I have never done this, and I too have questions about this topic:
>
> [Questions to the list]
> How is it possible that the cluster cannot repair itself with ceph pg
> repair?
> No good copies are remaining?
> Cannot decide which copy is valid or up-to date?
> If so, why not, when there is checksum, mtime for everything?
> In this inconsistent state which object does the cluster serve when it
> doesn't know which one is the valid?
>
>
> Isn't there a way to do a more "online" repair?
>
> A way to examine, remove objects while running the osd?
>
> Or better yet, to tell the cluster that which copy should be used when
> repairing?
>
> There is a command, ceph pg force-recovery, but I cannot find
> documentation for it.
>
>
> Kind regards,
>
> Denes Dolhay.
>
>
>
> On 10/28/2017 01:05 PM, Mario Giammarco wrote:
>
> Hello,
> we recently upgraded two clusters to Ceph luminous with bluestore and we
> discovered that we have many more pgs in state active+clean+inconsistent.
> (Possible data damage, xx pgs inconsistent)
>
> This is probably due to checksums in bluestore that discover more errors.
>
> We have some pools with replica 2 and some with replica 3.
>
> I have read past forums thread and I have seen that Ceph do not repair
> automatically inconsistent pgs.
>
> Even manual repair sometime fails.
>
> I would like to understand if I am losing my data:
>
> - with replica 2 I hope that ceph chooses right replica looking at
> checksums
> - with replica 3 I hope that there are no problems at all
>
> How can I tell ceph to simply create the second replica in another place?
>
> Because I suppose that with replica 2 and inconsistent pgs I have only one
> copy of data.
>
> Thank you in advance for any help.
>
> Mario
>
>
>
>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs inconsistent, do I fear data loss?

2017-10-28 Thread Denes Dolhay

Hello,

First of all, I would recommend, that you use ceph pg repair wherever 
you can.



When you have size=3 the cluster can compare 3 instances, therefore it 
is easier for it to spot which two is good, and which one is bad.


When you use size=2 the case is harder for o-so-many ways:

-According to the documentation it is harder to determine which object 
is the faulty.


-If an osd dies the increased load (caused by the missing osd) and the 
extra io from the recovery process hits the other osd much harder, 
increasing the chance that another osd dies (because of disk failure 
caused by the sudden spike of extra load), and then you loose your data


-If there is a bitrot in the remaining one replica, then you do not have 
any valid copies for your data


So, to summarize it, the experts say, that it is MUCH safer to have 
size=3 min_size=2 (I'm far from an expert, I'm just quoting :))



So, back to the task at hand:

If you repaired all pgs that you coud by ceph pg repair, there is a 
manual recovery process, (written for filestore unfortunately):


http://ceph.com/geen-categorie/ceph-manually-repair-object/

The good news is, that there is a fuse client for bluestore too, so you 
can mount it by hand and repair it as per the linked document,



I think that you could ceph osd pool set [pool] size 3 yo increase the 
copy count, but before that you should be certain that you have enough 
free space, and you'll not hit the osd pg count limits.



[DISCLAIMER]:
I have never done this, and I too have questions about this topic:

[Questions to the list]
How is it possible that the cluster cannot repair itself with ceph pg 
repair?

No good copies are remaining?
Cannot decide which copy is valid or up-to date?
If so, why not, when there is checksum, mtime for everything?
In this inconsistent state which object does the cluster serve when it 
doesn't know which one is the valid?



Isn't there a way to do a more "online" repair?

A way to examine, remove objects while running the osd?

Or better yet, to tell the cluster that which copy should be used when 
repairing?


There is a command, ceph pg force-recovery, but I cannot find 
documentation for it.



Kind regards,

Denes Dolhay.



On 10/28/2017 01:05 PM, Mario Giammarco wrote:

Hello,
we recently upgraded two clusters to Ceph luminous with bluestore and 
we discovered that we have many more pgs in state 
active+clean+inconsistent. (Possible data damage, xx pgs inconsistent)

This is probably due to checksums in bluestore that discover more errors.

We have some pools with replica 2 and some with replica 3.

I have read past forums thread and I have seen that Ceph do not repair 
automatically inconsistent pgs.


Even manual repair sometime fails.

I would like to understand if I am losing my data:

- with replica 2 I hope that ceph chooses right replica looking at 
checksums

- with replica 3 I hope that there are no problems at all

How can I tell ceph to simply create the second replica in another place?

Because I suppose that with replica 2 and inconsistent pgs I have only 
one copy of data.


Thank you in advance for any help.

Mario







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs inconsistent, do I fear data loss?

2017-10-28 Thread Mario Giammarco
Hello,
we recently upgraded two clusters to Ceph luminous with bluestore and we
discovered that we have many more pgs in state active+clean+inconsistent.
(Possible data damage, xx pgs inconsistent)

This is probably due to checksums in bluestore that discover more errors.

We have some pools with replica 2 and some with replica 3.

I have read past forums thread and I have seen that Ceph do not repair
automatically inconsistent pgs.

Even manual repair sometime fails.

I would like to understand if I am losing my data:

- with replica 2 I hope that ceph chooses right replica looking at checksums
- with replica 3 I hope that there are no problems at all

How can I tell ceph to simply create the second replica in another place?

Because I suppose that with replica 2 and inconsistent pgs I have only one
copy of data.

Thank you in advance for any help.

Mario
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com