Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-16 Thread Massimo Sgaravatto
And I confirm that a repair is not useful. As as far I can see it simply
"cleans" the error (without modifying the big object) but the error of
course reappears when the deep scrub runs again on that PG

Cheers, Massimo

On Thu, Jan 16, 2020 at 9:35 AM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> In my cluster I saw that the problematic objects have been uploaded by a
> specific application (onedata), which I think used to upload the files
> doing something like:
>
> rados --pool  put  
>
> Now (since Luminous ?) the default object size is 128MB but if I am not
> wrong it was 100GB before.
> This would explain why I have such big objects around (which indeed have
> an old timestamp)
>
> Cheers, Massimo
>
> On Wed, Jan 15, 2020 at 7:06 PM Liam Monahan  wrote:
>
>> I just changed my max object size to 256MB and scrubbed and the errors
>> went away.  I’m not sure what can be done to reduce the size of these
>> objects, though, if it really is a problem.  Our cluster has dynamic bucket
>> index resharding turned on, but that sharding process shouldn’t help it if
>> non-index objects are what is over the limit.
>>
>> I don’t think a pg repair would do anything unless the config tunables
>> are adjusted.
>>
>> On Jan 15, 2020, at 10:56 AM, Massimo Sgaravatto <
>> massimo.sgarava...@gmail.com> wrote:
>>
>> I never changed the default value for that attribute
>>
>> I am missing why I have such big objects around
>>
>> I am also wondering what a pg repair would do in such case
>>
>> Il mer 15 gen 2020, 16:18 Liam Monahan  ha scritto:
>>
>>> Thanks for that link.
>>>
>>> Do you have a default osd max object size of 128M?  I’m thinking about
>>> doubling that limit to 256MB on our cluster.  Our largest object is only
>>> about 10% over that limit.
>>>
>>> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto <
>>> massimo.sgarava...@gmail.com> wrote:
>>>
>>> I guess this is coming from:
>>>
>>> https://github.com/ceph/ceph/pull/30783
>>>
>>> introduced in Nautilus 14.2.5
>>>
>>> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto <
>>> massimo.sgarava...@gmail.com> wrote:
>>>
 As I wrote here:


 http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html

 I saw the same after an update from Luminous to Nautilus 14.2.6

 Cheers, Massimo

 On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan 
 wrote:

> Hi,
>
> I am getting one inconsistent object on our cluster with an
> inconsistency error that I haven’t seen before.  This started happening
> during a rolling upgrade of the cluster from 14.2.3 -> 14.2.6, but I am 
> not
> sure that’s related.
>
> I was hoping to know what the error means before trying a repair.
>
> [root@objmon04 ~]# ceph health detail
> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1
> pg inconsistent
> OSDMAP_FLAGS noout flag(s) set
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>
> rados list-inconsistent-obj 9.20e --format=json-pretty
> {
> "epoch": 759019,
> "inconsistents": [
> {
> "object": {
> "name":
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 692875
> },
> "errors": [
> "size_too_large"
> ],
> "union_shard_errors": [],
> "selected_object_info": {
> "oid": {
> "oid":
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "key": "",
> "snapid": -2,
> "hash": 3321413134,
> "max": 0,
> "pool": 9,
> "namespace": ""
> },
> "version": "281183'692875",
> "prior_version": "281183'692874",
> "last_reqid": "client.34042469.0:206759091",
> "user_version": 692875,
> "size": 146097278,
> "mtime": "2017-07-03 12:43:35.569986",
> "local_mtime": "2017-07-03 12:43:35.571196",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 0,
> "truncate_size": 0,
> "data_digest": "0xf19c8035",
> "omap_digest": "0x",
> "expected_object_size": 0,
>   

Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-16 Thread Massimo Sgaravatto
In my cluster I saw that the problematic objects have been uploaded by a
specific application (onedata), which I think used to upload the files
doing something like:

rados --pool  put  

Now (since Luminous ?) the default object size is 128MB but if I am not
wrong it was 100GB before.
This would explain why I have such big objects around (which indeed have an
old timestamp)

Cheers, Massimo

On Wed, Jan 15, 2020 at 7:06 PM Liam Monahan  wrote:

> I just changed my max object size to 256MB and scrubbed and the errors
> went away.  I’m not sure what can be done to reduce the size of these
> objects, though, if it really is a problem.  Our cluster has dynamic bucket
> index resharding turned on, but that sharding process shouldn’t help it if
> non-index objects are what is over the limit.
>
> I don’t think a pg repair would do anything unless the config tunables are
> adjusted.
>
> On Jan 15, 2020, at 10:56 AM, Massimo Sgaravatto <
> massimo.sgarava...@gmail.com> wrote:
>
> I never changed the default value for that attribute
>
> I am missing why I have such big objects around
>
> I am also wondering what a pg repair would do in such case
>
> Il mer 15 gen 2020, 16:18 Liam Monahan  ha scritto:
>
>> Thanks for that link.
>>
>> Do you have a default osd max object size of 128M?  I’m thinking about
>> doubling that limit to 256MB on our cluster.  Our largest object is only
>> about 10% over that limit.
>>
>> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto <
>> massimo.sgarava...@gmail.com> wrote:
>>
>> I guess this is coming from:
>>
>> https://github.com/ceph/ceph/pull/30783
>>
>> introduced in Nautilus 14.2.5
>>
>> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto <
>> massimo.sgarava...@gmail.com> wrote:
>>
>>> As I wrote here:
>>>
>>>
>>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html
>>>
>>> I saw the same after an update from Luminous to Nautilus 14.2.6
>>>
>>> Cheers, Massimo
>>>
>>> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan 
>>> wrote:
>>>
 Hi,

 I am getting one inconsistent object on our cluster with an
 inconsistency error that I haven’t seen before.  This started happening
 during a rolling upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not
 sure that’s related.

 I was hoping to know what the error means before trying a repair.

 [root@objmon04 ~]# ceph health detail
 HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1
 pg inconsistent
 OSDMAP_FLAGS noout flag(s) set
 OSD_SCRUB_ERRORS 1 scrub errors
 PG_DAMAGED Possible data damage: 1 pg inconsistent
 pg 9.20e is active+clean+inconsistent, acting [509,674,659]

 rados list-inconsistent-obj 9.20e --format=json-pretty
 {
 "epoch": 759019,
 "inconsistents": [
 {
 "object": {
 "name":
 "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
 "nspace": "",
 "locator": "",
 "snap": "head",
 "version": 692875
 },
 "errors": [
 "size_too_large"
 ],
 "union_shard_errors": [],
 "selected_object_info": {
 "oid": {
 "oid":
 "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
 "key": "",
 "snapid": -2,
 "hash": 3321413134,
 "max": 0,
 "pool": 9,
 "namespace": ""
 },
 "version": "281183'692875",
 "prior_version": "281183'692874",
 "last_reqid": "client.34042469.0:206759091",
 "user_version": 692875,
 "size": 146097278,
 "mtime": "2017-07-03 12:43:35.569986",
 "local_mtime": "2017-07-03 12:43:35.571196",
 "lost": 0,
 "flags": [
 "dirty",
 "data_digest",
 "omap_digest"
 ],
 "truncate_seq": 0,
 "truncate_size": 0,
 "data_digest": "0xf19c8035",
 "omap_digest": "0x",
 "expected_object_size": 0,
 "expected_write_size": 0,
 "alloc_hint_flags": 0,
 "manifest": {
 "type": 0
 },
 "watchers": {}
 },
 "shards": [
 {
 "osd": 509,
 "primary": true,
 "errors": [],
 "size": 146097278
   

Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-15 Thread Liam Monahan
I just changed my max object size to 256MB and scrubbed and the errors went 
away.  I’m not sure what can be done to reduce the size of these objects, 
though, if it really is a problem.  Our cluster has dynamic bucket index 
resharding turned on, but that sharding process shouldn’t help it if non-index 
objects are what is over the limit.

I don’t think a pg repair would do anything unless the config tunables are 
adjusted.

> On Jan 15, 2020, at 10:56 AM, Massimo Sgaravatto 
>  wrote:
> 
> I never changed the default value for that attribute
> 
> I am missing why I have such big objects around 
> 
> I am also wondering what a pg repair would do in such case
> 
> Il mer 15 gen 2020, 16:18 Liam Monahan  > ha scritto:
> Thanks for that link.
> 
> Do you have a default osd max object size of 128M?  I’m thinking about 
> doubling that limit to 256MB on our cluster.  Our largest object is only 
> about 10% over that limit.
> 
>> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto 
>> mailto:massimo.sgarava...@gmail.com>> wrote:
>> 
>> I guess this is coming from:
>> 
>> https://github.com/ceph/ceph/pull/30783 
>> 
>> 
>> introduced in Nautilus 14.2.5
>> 
>> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto 
>> mailto:massimo.sgarava...@gmail.com>> wrote:
>> As I wrote here:
>> 
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html 
>> 
>> 
>> I saw the same after an update from Luminous to Nautilus 14.2.6
>> 
>> Cheers, Massimo
>> 
>> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan > > wrote:
>> Hi,
>> 
>> I am getting one inconsistent object on our cluster with an inconsistency 
>> error that I haven’t seen before.  This started happening during a rolling 
>> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s 
>> related.
>> 
>> I was hoping to know what the error means before trying a repair.
>> 
>> [root@objmon04 ~]# ceph health detail
>> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
>> inconsistent
>> OSDMAP_FLAGS noout flag(s) set
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>> 
>> rados list-inconsistent-obj 9.20e --format=json-pretty
>> {
>> "epoch": 759019,
>> "inconsistents": [
>> {
>> "object": {
>> "name": 
>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>> "nspace": "",
>> "locator": "",
>> "snap": "head",
>> "version": 692875
>> },
>> "errors": [
>> "size_too_large"
>> ],
>> "union_shard_errors": [],
>> "selected_object_info": {
>> "oid": {
>> "oid": 
>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>> "key": "",
>> "snapid": -2,
>> "hash": 3321413134,
>> "max": 0,
>> "pool": 9,
>> "namespace": ""
>> },
>> "version": "281183'692875",
>> "prior_version": "281183'692874",
>> "last_reqid": "client.34042469.0:206759091",
>> "user_version": 692875,
>> "size": 146097278,
>> "mtime": "2017-07-03 12:43:35.569986",
>> "local_mtime": "2017-07-03 12:43:35.571196",
>> "lost": 0,
>> "flags": [
>> "dirty",
>> "data_digest",
>> "omap_digest"
>> ],
>> "truncate_seq": 0,
>> "truncate_size": 0,
>> "data_digest": "0xf19c8035",
>> "omap_digest": "0x",
>> "expected_object_size": 0,
>> "expected_write_size": 0,
>> "alloc_hint_flags": 0,
>> "manifest": {
>> "type": 0
>> },
>> "watchers": {}
>> },
>> "shards": [
>> {
>> "osd": 509,
>> "primary": true,
>> "errors": [],
>> "size": 146097278
>> },
>> {
>> "osd": 659,
>> "primary": false,
>> "errors": [],
>> "size": 146097278
>> },
>> {
>> "osd": 674,
>> "primary": false,
>> "errors": [],
>> "size": 146097278
>> }
>> ]
>> }
>> ]
>> }

Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-15 Thread Massimo Sgaravatto
I never changed the default value for that attribute

I am missing why I have such big objects around

I am also wondering what a pg repair would do in such case

Il mer 15 gen 2020, 16:18 Liam Monahan  ha scritto:

> Thanks for that link.
>
> Do you have a default osd max object size of 128M?  I’m thinking about
> doubling that limit to 256MB on our cluster.  Our largest object is only
> about 10% over that limit.
>
> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto <
> massimo.sgarava...@gmail.com> wrote:
>
> I guess this is coming from:
>
> https://github.com/ceph/ceph/pull/30783
>
> introduced in Nautilus 14.2.5
>
> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto <
> massimo.sgarava...@gmail.com> wrote:
>
>> As I wrote here:
>>
>>
>> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html
>>
>> I saw the same after an update from Luminous to Nautilus 14.2.6
>>
>> Cheers, Massimo
>>
>> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan  wrote:
>>
>>> Hi,
>>>
>>> I am getting one inconsistent object on our cluster with an
>>> inconsistency error that I haven’t seen before.  This started happening
>>> during a rolling upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not
>>> sure that’s related.
>>>
>>> I was hoping to know what the error means before trying a repair.
>>>
>>> [root@objmon04 ~]# ceph health detail
>>> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg
>>> inconsistent
>>> OSDMAP_FLAGS noout flag(s) set
>>> OSD_SCRUB_ERRORS 1 scrub errors
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>>>
>>> rados list-inconsistent-obj 9.20e --format=json-pretty
>>> {
>>> "epoch": 759019,
>>> "inconsistents": [
>>> {
>>> "object": {
>>> "name":
>>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>>> "nspace": "",
>>> "locator": "",
>>> "snap": "head",
>>> "version": 692875
>>> },
>>> "errors": [
>>> "size_too_large"
>>> ],
>>> "union_shard_errors": [],
>>> "selected_object_info": {
>>> "oid": {
>>> "oid":
>>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>>> "key": "",
>>> "snapid": -2,
>>> "hash": 3321413134,
>>> "max": 0,
>>> "pool": 9,
>>> "namespace": ""
>>> },
>>> "version": "281183'692875",
>>> "prior_version": "281183'692874",
>>> "last_reqid": "client.34042469.0:206759091",
>>> "user_version": 692875,
>>> "size": 146097278,
>>> "mtime": "2017-07-03 12:43:35.569986",
>>> "local_mtime": "2017-07-03 12:43:35.571196",
>>> "lost": 0,
>>> "flags": [
>>> "dirty",
>>> "data_digest",
>>> "omap_digest"
>>> ],
>>> "truncate_seq": 0,
>>> "truncate_size": 0,
>>> "data_digest": "0xf19c8035",
>>> "omap_digest": "0x",
>>> "expected_object_size": 0,
>>> "expected_write_size": 0,
>>> "alloc_hint_flags": 0,
>>> "manifest": {
>>> "type": 0
>>> },
>>> "watchers": {}
>>> },
>>> "shards": [
>>> {
>>> "osd": 509,
>>> "primary": true,
>>> "errors": [],
>>> "size": 146097278
>>> },
>>> {
>>> "osd": 659,
>>> "primary": false,
>>> "errors": [],
>>> "size": 146097278
>>> },
>>> {
>>> "osd": 674,
>>> "primary": false,
>>> "errors": [],
>>> "size": 146097278
>>> }
>>> ]
>>> }
>>> ]
>>> }
>>>
>>> Thanks,
>>> Liam
>>> —
>>> Senior Developer
>>> Institute for Advanced Computer Studies
>>> University of Maryland
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-15 Thread Liam Monahan
Thanks for that link.

Do you have a default osd max object size of 128M?  I’m thinking about doubling 
that limit to 256MB on our cluster.  Our largest object is only about 10% over 
that limit.

> On Jan 15, 2020, at 3:51 AM, Massimo Sgaravatto 
>  wrote:
> 
> I guess this is coming from:
> 
> https://github.com/ceph/ceph/pull/30783 
> 
> 
> introduced in Nautilus 14.2.5
> 
> On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto 
> mailto:massimo.sgarava...@gmail.com>> wrote:
> As I wrote here:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html 
> 
> 
> I saw the same after an update from Luminous to Nautilus 14.2.6
> 
> Cheers, Massimo
> 
> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan  > wrote:
> Hi,
> 
> I am getting one inconsistent object on our cluster with an inconsistency 
> error that I haven’t seen before.  This started happening during a rolling 
> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s 
> related.
> 
> I was hoping to know what the error means before trying a repair.
> 
> [root@objmon04 ~]# ceph health detail
> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
> inconsistent
> OSDMAP_FLAGS noout flag(s) set
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
> 
> rados list-inconsistent-obj 9.20e --format=json-pretty
> {
> "epoch": 759019,
> "inconsistents": [
> {
> "object": {
> "name": 
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 692875
> },
> "errors": [
> "size_too_large"
> ],
> "union_shard_errors": [],
> "selected_object_info": {
> "oid": {
> "oid": 
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "key": "",
> "snapid": -2,
> "hash": 3321413134,
> "max": 0,
> "pool": 9,
> "namespace": ""
> },
> "version": "281183'692875",
> "prior_version": "281183'692874",
> "last_reqid": "client.34042469.0:206759091",
> "user_version": 692875,
> "size": 146097278,
> "mtime": "2017-07-03 12:43:35.569986",
> "local_mtime": "2017-07-03 12:43:35.571196",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 0,
> "truncate_size": 0,
> "data_digest": "0xf19c8035",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> },
> "shards": [
> {
> "osd": 509,
> "primary": true,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 659,
> "primary": false,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 674,
> "primary": false,
> "errors": [],
> "size": 146097278
> }
> ]
> }
> ]
> }
> 
> Thanks,
> Liam
> —
> Senior Developer
> Institute for Advanced Computer Studies
> University of Maryland
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-15 Thread Massimo Sgaravatto
I guess this is coming from:

https://github.com/ceph/ceph/pull/30783

introduced in Nautilus 14.2.5

On Wed, Jan 15, 2020 at 8:10 AM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> As I wrote here:
>
>
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html
>
> I saw the same after an update from Luminous to Nautilus 14.2.6
>
> Cheers, Massimo
>
> On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan  wrote:
>
>> Hi,
>>
>> I am getting one inconsistent object on our cluster with an inconsistency
>> error that I haven’t seen before.  This started happening during a rolling
>> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s
>> related.
>>
>> I was hoping to know what the error means before trying a repair.
>>
>> [root@objmon04 ~]# ceph health detail
>> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg
>> inconsistent
>> OSDMAP_FLAGS noout flag(s) set
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>>
>> rados list-inconsistent-obj 9.20e --format=json-pretty
>> {
>> "epoch": 759019,
>> "inconsistents": [
>> {
>> "object": {
>> "name":
>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>> "nspace": "",
>> "locator": "",
>> "snap": "head",
>> "version": 692875
>> },
>> "errors": [
>> "size_too_large"
>> ],
>> "union_shard_errors": [],
>> "selected_object_info": {
>> "oid": {
>> "oid":
>> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
>> "key": "",
>> "snapid": -2,
>> "hash": 3321413134,
>> "max": 0,
>> "pool": 9,
>> "namespace": ""
>> },
>> "version": "281183'692875",
>> "prior_version": "281183'692874",
>> "last_reqid": "client.34042469.0:206759091",
>> "user_version": 692875,
>> "size": 146097278,
>> "mtime": "2017-07-03 12:43:35.569986",
>> "local_mtime": "2017-07-03 12:43:35.571196",
>> "lost": 0,
>> "flags": [
>> "dirty",
>> "data_digest",
>> "omap_digest"
>> ],
>> "truncate_seq": 0,
>> "truncate_size": 0,
>> "data_digest": "0xf19c8035",
>> "omap_digest": "0x",
>> "expected_object_size": 0,
>> "expected_write_size": 0,
>> "alloc_hint_flags": 0,
>> "manifest": {
>> "type": 0
>> },
>> "watchers": {}
>> },
>> "shards": [
>> {
>> "osd": 509,
>> "primary": true,
>> "errors": [],
>> "size": 146097278
>> },
>> {
>> "osd": 659,
>> "primary": false,
>> "errors": [],
>> "size": 146097278
>> },
>> {
>> "osd": 674,
>> "primary": false,
>> "errors": [],
>> "size": 146097278
>> }
>> ]
>> }
>> ]
>> }
>>
>> Thanks,
>> Liam
>> —
>> Senior Developer
>> Institute for Advanced Computer Studies
>> University of Maryland
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent with error "size_too_large"

2020-01-14 Thread Massimo Sgaravatto
As I wrote here:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2020-January/037909.html

I saw the same after an update from Luminous to Nautilus 14.2.6

Cheers, Massimo

On Tue, Jan 14, 2020 at 7:45 PM Liam Monahan  wrote:

> Hi,
>
> I am getting one inconsistent object on our cluster with an inconsistency
> error that I haven’t seen before.  This started happening during a rolling
> upgrade of the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s
> related.
>
> I was hoping to know what the error means before trying a repair.
>
> [root@objmon04 ~]# ceph health detail
> HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg
> inconsistent
> OSDMAP_FLAGS noout flag(s) set
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 9.20e is active+clean+inconsistent, acting [509,674,659]
>
> rados list-inconsistent-obj 9.20e --format=json-pretty
> {
> "epoch": 759019,
> "inconsistents": [
> {
> "object": {
> "name":
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 692875
> },
> "errors": [
> "size_too_large"
> ],
> "union_shard_errors": [],
> "selected_object_info": {
> "oid": {
> "oid":
> "2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
> "key": "",
> "snapid": -2,
> "hash": 3321413134,
> "max": 0,
> "pool": 9,
> "namespace": ""
> },
> "version": "281183'692875",
> "prior_version": "281183'692874",
> "last_reqid": "client.34042469.0:206759091",
> "user_version": 692875,
> "size": 146097278,
> "mtime": "2017-07-03 12:43:35.569986",
> "local_mtime": "2017-07-03 12:43:35.571196",
> "lost": 0,
> "flags": [
> "dirty",
> "data_digest",
> "omap_digest"
> ],
> "truncate_seq": 0,
> "truncate_size": 0,
> "data_digest": "0xf19c8035",
> "omap_digest": "0x",
> "expected_object_size": 0,
> "expected_write_size": 0,
> "alloc_hint_flags": 0,
> "manifest": {
> "type": 0
> },
> "watchers": {}
> },
> "shards": [
> {
> "osd": 509,
> "primary": true,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 659,
> "primary": false,
> "errors": [],
> "size": 146097278
> },
> {
> "osd": 674,
> "primary": false,
> "errors": [],
> "size": 146097278
> }
> ]
> }
> ]
> }
>
> Thanks,
> Liam
> —
> Senior Developer
> Institute for Advanced Computer Studies
> University of Maryland
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG inconsistent with error "size_too_large"

2020-01-14 Thread Liam Monahan
Hi,

I am getting one inconsistent object on our cluster with an inconsistency error 
that I haven’t seen before.  This started happening during a rolling upgrade of 
the cluster from 14.2.3 -> 14.2.6, but I am not sure that’s related.

I was hoping to know what the error means before trying a repair.

[root@objmon04 ~]# ceph health detail
HEALTH_ERR noout flag(s) set; 1 scrub errors; Possible data damage: 1 pg 
inconsistent
OSDMAP_FLAGS noout flag(s) set
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 9.20e is active+clean+inconsistent, acting [509,674,659]

rados list-inconsistent-obj 9.20e --format=json-pretty
{
"epoch": 759019,
"inconsistents": [
{
"object": {
"name": 
"2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
"nspace": "",
"locator": "",
"snap": "head",
"version": 692875
},
"errors": [
"size_too_large"
],
"union_shard_errors": [],
"selected_object_info": {
"oid": {
"oid": 
"2017-07-03-12-8b980d5b-23de-41f9-8b14-84a5bbc3f1c9.31293422.4-activedns-diff",
"key": "",
"snapid": -2,
"hash": 3321413134,
"max": 0,
"pool": 9,
"namespace": ""
},
"version": "281183'692875",
"prior_version": "281183'692874",
"last_reqid": "client.34042469.0:206759091",
"user_version": 692875,
"size": 146097278,
"mtime": "2017-07-03 12:43:35.569986",
"local_mtime": "2017-07-03 12:43:35.571196",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0xf19c8035",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0
},
"watchers": {}
},
"shards": [
{
"osd": 509,
"primary": true,
"errors": [],
"size": 146097278
},
{
"osd": 659,
"primary": false,
"errors": [],
"size": 146097278
},
{
"osd": 674,
"primary": false,
"errors": [],
"size": 146097278
}
]
}
]
}

Thanks,
Liam
—
Senior Developer
Institute for Advanced Computer Studies
University of Maryland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread Brad Hubbard
On Tue, Sep 25, 2018 at 7:50 PM Sergey Malinin  wrote:
>
> # rados list-inconsistent-obj 1.92
> {"epoch":519,"inconsistents":[]}

It's likely the epoch has changed since the last scrub and you'll need
to run another scrub to repopulate this data.

>
> September 25, 2018 4:58 AM, "Brad Hubbard"  wrote:
>
> > What does the output of the following command look like?
> >
> > $ rados list-inconsistent-obj 1.92
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread Sergey Malinin
# rados list-inconsistent-obj 1.92
{"epoch":519,"inconsistents":[]}

September 25, 2018 4:58 AM, "Brad Hubbard"  wrote:

> What does the output of the following command look like?
> 
> $ rados list-inconsistent-obj 1.92
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread Marc Roos
 
And where is the manual for bluestore?



-Original Message-
From: mj [mailto:li...@merit.unu.edu] 
Sent: dinsdag 25 september 2018 9:56
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] PG inconsistent, "pg repair" not working

Hi,

I was able to solve a similar issue on our cluster using this blog:

https://ceph.com/geen-categorie/ceph-manually-repair-object/

It does help if you are running a 3/2 config.

Perhaps it helps you as well.

MJ

On 09/25/2018 02:37 AM, Sergey Malinin wrote:
> Hello,
> During normal operation our cluster suddenly thrown an error and since 

> then we have had 1 inconsistent PG, and one of clients sharing cephfs 
> mount has started to occasionally log "ceph: Failed to find inode X".
> "ceph pg repair" deep scrubs the PG and fails with the same error in 
log.
> Can anyone advise how to fix this?
> 
> 
> log entry:
> 2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] 
: 
> 1.92 soid 1:496296a8:::1000f44d0f4.0018:head: failed to pick 
> suitable object info 2018-09-20 06:48:23.081 7f0b2efd9700 -1 
> log_channel(cluster) log [ERR] :
> scrub 1.92 1:496296a8:::1000f44d0f4.0018:head on disk size 
> (3751936) does not match object info size (0) adjusted for ondisk to 
> (0) 2018-09-20 06:50:36.925 7f0b2efd9700 -1 log_channel(cluster) log 
[ERR] :
> 1.92 scrub 3 errors
> 
> # ceph -v
> ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic
> (stable)
> 
> 
> # ceph health detail
> HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent 
> OSD_SCRUB_ERRORS 3 scrub errors PG_DAMAGED Possible data damage: 1 pg 
> inconsistent pg 1.92 is active+clean+inconsistent, acting [4,9]
> 
> 
> # rados list-inconsistent-obj 1.92
> {"epoch":519,"inconsistents":[]}
> 
> 
> # ceph pg 1.92 query
> {
> "state": "active+clean+inconsistent",
> "snap_trimq": "[]",
> "snap_trimq_len": 0,
> "epoch": 520,
> "up": [
> 4,
> 9
> ],
> "acting": [
> 4,
> 9
> ],
> "acting_recovery_backfill": [
> "4",
> "9"
> ],
> "info": {
> "pgid": "1.92",
> "last_update": "520'2456340",
> "last_complete": "520'2456340",
> "log_tail": "520'2453330",
> "last_user_version": 7914566,
> "last_backfill": "MAX",
> "last_backfill_bitwise": 0,
> "purged_snaps": [],
> "history": {
> "epoch_created": 63,
> "epoch_pool_created": 63,
> "last_epoch_started": 520,
> "last_interval_started": 519,
> "last_epoch_clean": 520,
> "last_interval_clean": 519,
> "last_epoch_split": 0,
> "last_epoch_marked_full": 0,
> "same_up_since": 519,
> "same_interval_since": 519,
> "same_primary_since": 514,
> "last_scrub": "520'2456105",
> "last_scrub_stamp": "2018-09-25 02:17:35.631365",
> "last_deep_scrub": "520'2456105",
> "last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
> "last_clean_scrub_stamp": "2018-09-19 02:27:22.656268"
> },
> "stats": {
> "version": "520'2456340",
> "reported_seq": "6115579",
> "reported_epoch": "520",
> "state": "active+clean+inconsistent",
> "last_fresh": "2018-09-25 03:02:34.338256",
> "last_change": "2018-09-25 02:17:35.631476",
> "last_active": "2018-09-25 03:02:34.338256",
> "last_peered": "2018-09-25 03:02:34.338256",
> "last_clean": "2018-09-25 03:02:34.338256",
> "last_became_active": "2018-09-24 15:25:30.238044",
> "last_became_peered": "2018-09-24 15:25:30.238044",
> "last_unstale": "2018-09-25 03:02:34.338256",
> "last_undegraded": "2018-09-25 03:02:34.338256",
> "last_fullsized": "2018-09-25 03:02:34.338256",
> "mapping_epoch": 519,
> "log_start": "520'2453330",
> "ondisk_log_start": "520'2453330",
> "created": 63,
> "last_epoch_clean": 520,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "520'2456105",
> "last_scrub_stamp": "2018-09-25 02:17:35.631365",
> "last_deep_scrub": "520'24561

Re: [ceph-users] PG inconsistent, "pg repair" not working

2018-09-25 Thread mj

Hi,

I was able to solve a similar issue on our cluster using this blog:

https://ceph.com/geen-categorie/ceph-manually-repair-object/

It does help if you are running a 3/2 config.

Perhaps it helps you as well.

MJ

On 09/25/2018 02:37 AM, Sergey Malinin wrote:

Hello,
During normal operation our cluster suddenly thrown an error and since 
then we have had 1 inconsistent PG, and one of clients sharing cephfs 
mount has started to occasionally log "ceph: Failed to find inode X".

"ceph pg repair" deep scrubs the PG and fails with the same error in log.
Can anyone advise how to fix this?


log entry:
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 
1.92 soid 1:496296a8:::1000f44d0f4.0018:head: failed to pick 
suitable object info
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 
scrub 1.92 1:496296a8:::1000f44d0f4.0018:head on disk size (3751936) 
does not match object info size (0) adjusted for ondisk to (0)
2018-09-20 06:50:36.925 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 
1.92 scrub 3 errors


# ceph -v
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic 
(stable)



# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.92 is active+clean+inconsistent, acting [4,9]


# rados list-inconsistent-obj 1.92
{"epoch":519,"inconsistents":[]}


# ceph pg 1.92 query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 520,
"up": [
4,
9
],
"acting": [
4,
9
],
"acting_recovery_backfill": [
"4",
"9"
],
"info": {
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "520'2456340",
"log_tail": "520'2453330",
"last_user_version": 7914566,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 63,
"epoch_pool_created": 63,
"last_epoch_started": 520,
"last_interval_started": 519,
"last_epoch_clean": 520,
"last_interval_clean": 519,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 519,
"same_interval_since": 519,
"same_primary_since": 514,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268"
},
"stats": {
"version": "520'2456340",
"reported_seq": "6115579",
"reported_epoch": "520",
"state": "active+clean+inconsistent",
"last_fresh": "2018-09-25 03:02:34.338256",
"last_change": "2018-09-25 02:17:35.631476",
"last_active": "2018-09-25 03:02:34.338256",
"last_peered": "2018-09-25 03:02:34.338256",
"last_clean": "2018-09-25 03:02:34.338256",
"last_became_active": "2018-09-24 15:25:30.238044",
"last_became_peered": "2018-09-24 15:25:30.238044",
"last_unstale": "2018-09-25 03:02:34.338256",
"last_undegraded": "2018-09-25 03:02:34.338256",
"last_fullsized": "2018-09-25 03:02:34.338256",
"mapping_epoch": 519,
"log_start": "520'2453330",
"ondisk_log_start": "520'2453330",
"created": 63,
"last_epoch_clean": 520,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268",
"log_size": 3010,
"ondisk_log_size": 3010,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 23138366490,
"num_objects": 479532,
"num_object_clones": 0,
"num_object_copies": 959064,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 479532,
"num_whiteouts": 0,
"num_read": 3295720,
"num_read_kb": 63508374,
"num_write": 2495519,
"num_write_kb": 81795199,
"num_scrub_errors": 3,
"num_shallow_scrub_errors": 3,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 550,
"num_bytes_recovered": 15760916,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0
},
"up": [
4,
9
],
"acting": [
4,
9
],
"blocked_by": [],
"up_primary": 4,
"acting_primary": 4,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 520,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "9",
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "515'2438936",
"log_tail": 

[ceph-users] PG inconsistent, "pg repair" not working

2018-09-24 Thread Sergey Malinin
Hello,
During normal operation our cluster suddenly thrown an error and since then we 
have had 1 inconsistent PG, and one of clients sharing cephfs mount has started 
to occasionally log "ceph: Failed to find inode X".
"ceph pg repair" deep scrubs the PG and fails with the same error in log.
Can anyone advise how to fix this?
log entry:
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 1.92 
soid 1:496296a8:::1000f44d0f4.0018:head: failed to pick suitable object info
2018-09-20 06:48:23.081 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : scrub 
1.92 1:496296a8:::1000f44d0f4.0018:head on disk size (3751936) does not 
match object info size (0) adjusted for ondisk to (0)
2018-09-20 06:50:36.925 7f0b2efd9700 -1 log_channel(cluster) log [ERR] : 1.92 
scrub 3 errors

# ceph -v
ceph version 13.2.1 (5533ecdc0fda920179d7ad84e0aa65a127b20d77) mimic (stable)
# ceph health detail
HEALTH_ERR 3 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 3 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 1.92 is active+clean+inconsistent, acting [4,9]
# rados list-inconsistent-obj 1.92
{"epoch":519,"inconsistents":[]}
# ceph pg 1.92 query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"snap_trimq_len": 0,
"epoch": 520,
"up": [
4,
9
],
"acting": [
4,
9
],
"acting_recovery_backfill": [
"4",
"9"
],
"info": {
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "520'2456340",
"log_tail": "520'2453330",
"last_user_version": 7914566,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 63,
"epoch_pool_created": 63,
"last_epoch_started": 520,
"last_interval_started": 519,
"last_epoch_clean": 520,
"last_interval_clean": 519,
"last_epoch_split": 0,
"last_epoch_marked_full": 0,
"same_up_since": 519,
"same_interval_since": 519,
"same_primary_since": 514,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268"
},
"stats": {
"version": "520'2456340",
"reported_seq": "6115579",
"reported_epoch": "520",
"state": "active+clean+inconsistent",
"last_fresh": "2018-09-25 03:02:34.338256",
"last_change": "2018-09-25 02:17:35.631476",
"last_active": "2018-09-25 03:02:34.338256",
"last_peered": "2018-09-25 03:02:34.338256",
"last_clean": "2018-09-25 03:02:34.338256",
"last_became_active": "2018-09-24 15:25:30.238044",
"last_became_peered": "2018-09-24 15:25:30.238044",
"last_unstale": "2018-09-25 03:02:34.338256",
"last_undegraded": "2018-09-25 03:02:34.338256",
"last_fullsized": "2018-09-25 03:02:34.338256",
"mapping_epoch": 519,
"log_start": "520'2453330",
"ondisk_log_start": "520'2453330",
"created": 63,
"last_epoch_clean": 520,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "520'2456105",
"last_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_deep_scrub": "520'2456105",
"last_deep_scrub_stamp": "2018-09-25 02:17:35.631365",
"last_clean_scrub_stamp": "2018-09-19 02:27:22.656268",
"log_size": 3010,
"ondisk_log_size": 3010,
"stats_invalid": false,
"dirty_stats_invalid": false,
"omap_stats_invalid": false,
"hitset_stats_invalid": false,
"hitset_bytes_stats_invalid": false,
"pin_stats_invalid": false,
"manifest_stats_invalid": false,
"snaptrimq_len": 0,
"stat_sum": {
"num_bytes": 23138366490,
"num_objects": 479532,
"num_object_clones": 0,
"num_object_copies": 959064,
"num_objects_missing_on_primary": 0,
"num_objects_missing": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 479532,
"num_whiteouts": 0,
"num_read": 3295720,
"num_read_kb": 63508374,
"num_write": 2495519,
"num_write_kb": 81795199,
"num_scrub_errors": 3,
"num_shallow_scrub_errors": 3,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 550,
"num_bytes_recovered": 15760916,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0,
"num_flush": 0,
"num_flush_kb": 0,
"num_evict": 0,
"num_evict_kb": 0,
"num_promote": 0,
"num_flush_mode_high": 0,
"num_flush_mode_low": 0,
"num_evict_mode_some": 0,
"num_evict_mode_full": 0,
"num_objects_pinned": 0,
"num_legacy_snapsets": 0,
"num_large_omap_objects": 0,
"num_objects_manifest": 0
},
"up": [
4,
9
],
"acting": [
4,
9
],
"blocked_by": [],
"up_primary": 4,
"acting_primary": 4,
"purged_snaps": []
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 520,
"hit_set_history": {
"current_last_update": "0'0",
"history": []
}
},
"peer_info": [
{
"peer": "9",
"pgid": "1.92",
"last_update": "520'2456340",
"last_complete": "515'2438936",
"log_tail": "511'2435926",
"last_user_version": 7902301,
"last_backfill": "MAX",
"last_backfill_bitwise": 0,
"purged_snaps": [],
"history": {
"epoch_created": 63,
"epoch_pool_created": 63,
"last_epoch_started": 520,
"last_interval_started": 519,
"last_epoch_clean": 520,
"last_interval_clean": 

Re: [ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-20 Thread David Turner
As a part of the repair operation it runs a deep-scrub on the PG.  If it
showed active+clean after the repair and deep-scrub finished, then the next
run of a scrub on the PG shouldn't change the PG status at all.

On Wed, Jun 6, 2018 at 8:57 PM Adrian  wrote:

> Update to this.
>
> The affected pg didn't seem inconsistent:
>
> [root@admin-ceph1-qh2 ~]# ceph health detail
>
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
>pg 6.20 is active+clean+inconsistent, acting [114,26,44]
> [root@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20
> --format=json-pretty
> {
>"epoch": 210034,
>"inconsistents": []
> }
>
> Although pg query showed the primary info.stats.stat_sum.num_bytes
> differed from the peers
>
> A pg repair on 6.20 seems to have resolved the issue for now but the
> info.stats.stat_sum.num_bytes still differs so presumably will become
> inconsistent again next time it scrubs.
>
> Adrian.
>
> On Tue, Jun 5, 2018 at 12:09 PM, Adrian  wrote:
>
>> Hi Cephers,
>>
>> We recently upgraded one of our clusters from hammer to jewel and then to
>> luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
>> deep-scubs we have an inconsistent pg with a log message we've not seen
>> before:
>>
>> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
>> OSD_SCRUB_ERRORS 1 scrub errors
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>> pg 6.20 is active+clean+inconsistent, acting [114,26,44]
>>
>>
>> Ceph log shows
>>
>> 2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : 
>> cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 
>> 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
>> 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
>> 2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : 
>> cluster [ERR] 6.20 scrub 1 errors
>> 2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 
>> : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
>> 2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 
>> : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent 
>> (PG_DAMAGED)
>> 2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 
>> : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 
>> pg inconsistent
>>
>> There are no EC pools - looks like it may be the same as
>> https://tracker.ceph.com/issues/22656 although as in #7 this is not a
>> cache pool.
>>
>> Wondering if this is ok to issue a pg repair on 6.20 or if there's
>> something else we should be looking at first ?
>>
>> Thanks in advance,
>> Adrian.
>>
>> ---
>> Adrian : aussie...@gmail.com
>> If violence doesn't solve your problem, you're not using enough of it.
>>
>
>
>
> --
> ---
> Adrian : aussie...@gmail.com
> If violence doesn't solve your problem, you're not using enough of it.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-06 Thread Adrian
Update to this.

The affected pg didn't seem inconsistent:

[root@admin-ceph1-qh2 ~]# ceph health detail
HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
   pg 6.20 is active+clean+inconsistent, acting [114,26,44]
[root@admin-ceph1-qh2 ~]# rados list-inconsistent-obj 6.20
--format=json-pretty
{
   "epoch": 210034,
   "inconsistents": []
}

Although pg query showed the primary info.stats.stat_sum.num_bytes differed
from the peers

A pg repair on 6.20 seems to have resolved the issue for now but the
info.stats.stat_sum.num_bytes still differs so presumably will become
inconsistent again next time it scrubs.

Adrian.

On Tue, Jun 5, 2018 at 12:09 PM, Adrian  wrote:

> Hi Cephers,
>
> We recently upgraded one of our clusters from hammer to jewel and then to
> luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
> deep-scubs we have an inconsistent pg with a log message we've not seen
> before:
>
> HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
> OSD_SCRUB_ERRORS 1 scrub errors
> PG_DAMAGED Possible data damage: 1 pg inconsistent
> pg 6.20 is active+clean+inconsistent, acting [114,26,44]
>
>
> Ceph log shows
>
> 2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395 : 
> cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87 clones, 
> 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
> 25952454144/25952462336 bytes, 0/0 hit_set_archive bytes.
> 2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396 : 
> cluster [ERR] 6.20 scrub 1 errors
> 2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41298 
> : cluster [ERR] Health check failed: 1 scrub errors (OSD_SCRUB_ERRORS)
> 2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41299 
> : cluster [ERR] Health check failed: Possible data damage: 1 pg inconsistent 
> (PG_DAMAGED)
> 2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0 41345 
> : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg 
> inconsistent
>
> There are no EC pools - looks like it may be the same as
> https://tracker.ceph.com/issues/22656 although as in #7 this is not a
> cache pool.
>
> Wondering if this is ok to issue a pg repair on 6.20 or if there's
> something else we should be looking at first ?
>
> Thanks in advance,
> Adrian.
>
> ---
> Adrian : aussie...@gmail.com
> If violence doesn't solve your problem, you're not using enough of it.
>



-- 
---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg inconsistent, scrub stat mismatch on bytes

2018-06-04 Thread Adrian
Hi Cephers,

We recently upgraded one of our clusters from hammer to jewel and then to
luminous (12.2.5, 5 mons/mgr, 21 storage nodes * 9 osd's). After some
deep-scubs we have an inconsistent pg with a log message we've not seen
before:

HEALTH_ERR 1 scrub errors; Possible data damage: 1 pg inconsistent
OSD_SCRUB_ERRORS 1 scrub errors
PG_DAMAGED Possible data damage: 1 pg inconsistent
pg 6.20 is active+clean+inconsistent, acting [114,26,44]


Ceph log shows

2018-06-03 06:53:35.467791 osd.114 osd.114 172.26.28.25:6825/40819 395
: cluster [ERR] 6.20 scrub stat mismatch, got 6526/6526 objects, 87/87
clones, 6526/6526 dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive,
0/0 whiteouts, 25952454144/25952462336 bytes, 0/0 hit_set_archive
bytes.
2018-06-03 06:53:35.467799 osd.114 osd.114 172.26.28.25:6825/40819 396
: cluster [ERR] 6.20 scrub 1 errors
2018-06-03 06:53:40.701632 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0
41298 : cluster [ERR] Health check failed: 1 scrub errors
(OSD_SCRUB_ERRORS)
2018-06-03 06:53:40.701668 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0
41299 : cluster [ERR] Health check failed: Possible data damage: 1 pg
inconsistent (PG_DAMAGED)
2018-06-03 07:00:00.000137 mon.mon1-ceph1-qh2 mon.0 172.26.28.8:6789/0
41345 : cluster [ERR] overall HEALTH_ERR 1 scrub errors; Possible data
damage: 1 pg inconsistent

There are no EC pools - looks like it may be the same as
https://tracker.ceph.com/issues/22656 although as in #7 this is not a cache
pool.

Wondering if this is ok to issue a pg repair on 6.20 or if there's
something else we should be looking at first ?

Thanks in advance,
Adrian.

---
Adrian : aussie...@gmail.com
If violence doesn't solve your problem, you're not using enough of it.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent

2018-03-08 Thread Harald Staub

Hi Brad

Thank you very much for your attention.

On 07.03.2018 23:46, Brad Hubbard wrote:

On Thu, Mar 8, 2018 at 1:22 AM, Harald Staub  wrote:

"ceph pg repair" leads to:
5.7bd repair 2 errors, 0 fixed

Only an empty list from:
rados list-inconsistent-obj 5.7bd --format=json-pretty

Inspired by http://tracker.ceph.com/issues/12577 , I tried again with more
verbose logging and searched the osd logs e.g. for "!=", "mismatch", could
not find anything interesting. Oh well, these are several millions of lines
...

Any hint what I could look for?


Try searching for "scrub_compare_maps" and looking for "5.7bd" in that context.


These lines (from the primary OSD) may be interesting:

2018-03-07 14:20:31.405120 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] be_select_auth_object: error(s) osd 442 
for obj 5:bde7a84d:::rbd_data.d393823accce24.00010214:336d7, 
object_info_inconsistency
2018-03-07 14:20:31.405134 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] be_select_auth_object: selecting osd 340 
for obj 5:bde7a84d:::rbd_data.d393823accce24.00010214:336d7 with 
oi 
5:bde7a84d:::rbd_data.d393823accce24.00010214:336d7(505072'35716889 
osd.340.0:258067 dirty|data_digest|omap_digest s 4194304 uv 35452964 dd 
68383c60 od  alloc_hint [0 0 0])
2018-03-07 14:20:31.405172 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] be_select_auth_object: error(s) osd 442 
for obj 5:bde7a84d:::rbd_data.d393823accce24.00010214:head, 
snapset_inconsistency object_info_inconsistency
2018-03-07 14:20:31.405404 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] scrub_snapshot_metadata (repair) finish
2018-03-07 14:20:31.405413 7f42497c4700 10 osd.340 pg_epoch: 505959 
pg[5.7bd( v 505959'35722945 (505688'35721366,505959'35722945] 
local-lis/les=505083/505086 n=16133 ec=859/859 lis/c 505083/505083 
les/c/f 505086/505086/0 505083/505083/505083) [340,491,442] r=0 
lpr=505083 crt=505959'35722945 lcod 505959'35722944 mlcod 
505959'35722944 active+clean+scrubbing+deep+inconsistent+repair 
snaptrimq=[3565b~18,35674~2]] scrub_compare_maps: discarding scrub results


Then I had another idea. The inconsistency errors were triggered by a 
scrub, not a deep scrub. So I triggered another scrub:


ceph pg scrub 5.7bd

And the problem got fixed.

Cheers
 Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent

2018-03-07 Thread Brad Hubbard
On Thu, Mar 8, 2018 at 1:22 AM, Harald Staub  wrote:
> "ceph pg repair" leads to:
> 5.7bd repair 2 errors, 0 fixed
>
> Only an empty list from:
> rados list-inconsistent-obj 5.7bd --format=json-pretty
>
> Inspired by http://tracker.ceph.com/issues/12577 , I tried again with more
> verbose logging and searched the osd logs e.g. for "!=", "mismatch", could
> not find anything interesting. Oh well, these are several millions of lines
> ...
>
> Any hint what I could look for?

Try searching for "scrub_compare_maps" and looking for "5.7bd" in that context.

>
> The 3 OSDs involved are running on 12.2.4, one of them is on BlueStore.
>
> Cheers
>  Harry
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] pg inconsistent

2018-03-07 Thread Harald Staub

"ceph pg repair" leads to:
5.7bd repair 2 errors, 0 fixed

Only an empty list from:
rados list-inconsistent-obj 5.7bd --format=json-pretty

Inspired by http://tracker.ceph.com/issues/12577 , I tried again with 
more verbose logging and searched the osd logs e.g. for "!=", 
"mismatch", could not find anything interesting. Oh well, these are 
several millions of lines ...


Any hint what I could look for?

The 3 OSDs involved are running on 12.2.4, one of them is on BlueStore.

Cheers
 Harry
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent and repair doesn't work

2017-10-25 Thread Wei Jin
I found it is similar to bug: http://tracker.ceph.com/issues/21388.
And fix it by rados command.

The pg inconsistent info is like following,wish it could be fixed in the future.

root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# rados
list-inconsistent-obj 1.fcd --format=json-pretty
{
"epoch": 2373,
"inconsistents": [
{
"object": {
"name": "103528d.0058",
"nspace": "fsvolumens_87c46348-9869-11e7-8525-3497f65a8415",
"locator": "",
"snap": "head",
"version": 147490
},
"errors": [],
"union_shard_errors": [
"size_mismatch_oi"
],
"selected_object_info":
"1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])",
"shards": [
{
"osd": 27,
"errors": [
"size_mismatch_oi"
],
"size": 0,
"omap_digest": "0x",
"data_digest": "0x"
},
{
"osd": 62,
"errors": [
"size_mismatch_oi"
],
"size": 0,
"omap_digest": "0x",
"data_digest": "0x"
},
{
"osd": 133,
"errors": [
"size_mismatch_oi"
],
"size": 0,
"omap_digest": "0x",
"data_digest": "0x"
}
]
}
]
}

On Wed, Oct 25, 2017 at 12:05 PM, Wei Jin  wrote:
> Hi, list,
>
> We ran into pg deep scrub error. And we tried to repair it by `ceph pg
> repair pgid`. But it didn't work. We also verified object files,  and
> found both 3 replicas were zero size. What's the problem, whether it
> is a bug? And how to fix the inconsistent? I haven't restarted the
> osds so far as I am not sure whether it works.
>
> ceph version: 10.2.9
> user case: cephfs
> kernel client: 4.4/4.9
>
> Error info from primary osd:
>
> root@n10-075-019:~# grep -Hn 'ERR' /var/log/ceph/ceph-osd.27.log.1
> /var/log/ceph/ceph-osd.27.log.1:3038:2017-10-25 04:47:34.460536
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 27: soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> size 0 != size 3461120 from auth oi
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
> client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
>  alloc_hint [0 0])
> /var/log/ceph/ceph-osd.27.log.1:3039:2017-10-25 04:47:34.460722
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 62: soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> size 0 != size 3461120 from auth oi
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
> client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
>  alloc_hint [0 0])
> /var/log/ceph/ceph-osd.27.log.1:3040:2017-10-25 04:47:34.460725
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 133: soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> size 0 != size 3461120 from auth oi
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
> client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
>  alloc_hint [0 0])
> /var/log/ceph/ceph-osd.27.log.1:3041:2017-10-25 04:47:34.460800
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd soid
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head:
> failed to pick suitable auth object
> /var/log/ceph/ceph-osd.27.log.1:3042:2017-10-25 04:47:34.461458
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : deep-scrub 1.fcd
> 1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
> on disk size (0) does not match object info size (3461120) adjusted
> for ondisk to (3461120)
> /var/log/ceph/ceph-osd.27.log.1:3043:2017-10-25 04:47:44.645934
> 7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd deep-scrub 4
> errors
>
>
> Object file info:
>
> root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# find .
> -name "103528d.0058__head_12086FCD*"
> ./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
> root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# ls -al
> ./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
> -rw-r--r-- 1 

[ceph-users] pg inconsistent and repair doesn't work

2017-10-24 Thread Wei Jin
Hi, list,

We ran into pg deep scrub error. And we tried to repair it by `ceph pg
repair pgid`. But it didn't work. We also verified object files,  and
found both 3 replicas were zero size. What's the problem, whether it
is a bug? And how to fix the inconsistent? I haven't restarted the
osds so far as I am not sure whether it works.

ceph version: 10.2.9
user case: cephfs
kernel client: 4.4/4.9

Error info from primary osd:

root@n10-075-019:~# grep -Hn 'ERR' /var/log/ceph/ceph-osd.27.log.1
/var/log/ceph/ceph-osd.27.log.1:3038:2017-10-25 04:47:34.460536
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 27: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3039:2017-10-25 04:47:34.460722
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 62: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3040:2017-10-25 04:47:34.460725
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd shard 133: soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
size 0 != size 3461120 from auth oi
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head(2401'147490
client.901549.1:33749 dirty|omap_digest s 3461120 uv 147490 od
 alloc_hint [0 0])
/var/log/ceph/ceph-osd.27.log.1:3041:2017-10-25 04:47:34.460800
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd soid
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head:
failed to pick suitable auth object
/var/log/ceph/ceph-osd.27.log.1:3042:2017-10-25 04:47:34.461458
7f39c4829700 -1 log_channel(cluster) log [ERR] : deep-scrub 1.fcd
1:b3f61048:fsvolumens_87c46348-9869-11e7-8525-3497f65a8415::103528d.0058:head
on disk size (0) does not match object info size (3461120) adjusted
for ondisk to (3461120)
/var/log/ceph/ceph-osd.27.log.1:3043:2017-10-25 04:47:44.645934
7f39c4829700 -1 log_channel(cluster) log [ERR] : 1.fcd deep-scrub 4
errors


Object file info:

root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-019:/var/lib/ceph/osd/ceph-27/current/1.fcd_head#


root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-028:/var/lib/ceph/osd/ceph-62/current/1.fcd_head#


root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head# find .
-name "103528d.0058__head_12086FCD*"
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head# ls -al
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD*
-rw-r--r-- 1 ceph ceph 0 Oct 24 22:04
./DIR_D/DIR_C/DIR_F/DIR_6/DIR_8/103528d.0058__head_12086FCD_fsvolumens\u87c46348-9869-11e7-8525-3497f65a8415_1
root@n10-075-040:/var/lib/ceph/osd/ceph-133/current/1.fcd_head#
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-14 Thread Marc Roos
/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] 
Shutdown: canceling all background work
2017-08-09 11:41:25.471514 7f26db8ae100  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown 
complete
2017-08-09 11:41:25.686088 7f26db8ae100  1 bluefs umount
2017-08-09 11:41:25.705389 7f26db8ae100  1 bdev(0x7f26de472e00 
/var/lib/ceph/osd/ceph-0/block) close
2017-08-09 11:41:25.944548 7f26db8ae100  1 bdev(0x7f26de2b3a00 
/var/lib/ceph/osd/ceph-0/block) close












-Original Message-
From: Sage Weil [mailto:s...@newdream.net]
Sent: woensdag 9 augustus 2017 4:44
To: Brad Hubbard
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Pg inconsistent / export_files error -5

On Wed, 9 Aug 2017, Brad Hubbard wrote:
> Wee
> 
> On Wed, Aug 9, 2017 at 12:41 AM, Marc Roos <m.r...@f1-outsourcing.eu> 
wrote:
> >
> >
> >
> > The --debug indeed comes up with something
> > bluestore(/var/lib/ceph/osd/ceph-12) _verify_csum bad crc32c/0x1000 
> > checksum at blob offset 0x0, got 0x100ac314, expected 0x90407f75, 
> > device location [0x15a017~1000], logical extent 0x0~1000,
> >  bluestore(/var/lib/ceph/osd/ceph-9) _verify_csum bad crc32c/0x1000 
> > checksum at blob offset 0x0, got 0xb40b26a7, expected 0x90407f75, 
> > device location [0x2daea~1000], logical extent 0x0~1000,

What about the 3rd OSD?

It would be interesting to capture the fsck output for one of these.  
Stop the OSD, and then run

 ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-12 --log-file 
out \
--debug-bluestore 30 --no-log-to-stderr

That'll generate a pretty huge log, but should include dumps of onode 
metadata and will hopefully include something else with the checksum of
0x100ac314 so we can get some clue as to where the bad data came from.

Thanks!
sage


> >
> > I dont know how to interpret this, but am I correct to understand 
> > that data has been written across the cluster to these 3 osd's and 
> > all 3 have somehow received something different?
> 
> Did you run this command on OSD 0? What was the output in that case?
> 
> Possibly, all we currently know for sure is that the crc32c checksum 
> for the object on OSDs 12 and 9 do not match the expected checksum 
> according to the code when we attempt to read the object 
> #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#. There 
> seems to be some history behind this based on your previous emails 
> regarding these OSDs (12,9,0, and possibly 13) could you give us as 
> much detail as possible about how this issue came about and what you 
> have done in the interim to try to resolve it?
> 
> When was the first indication there was a problem with pg 17.36? Did 
> this correspond with any significant event?
> 
> Are these OSDs all on separate hosts?
> 
> It's possible ceph-bluestore-tool may help here but I would hold off 
> on that option until we understand the issue better.
> 
> 
> >
> >
> > size=4194304 object_info:
> > 17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head(5387'351
> > 57
> > client.2096993.0:78941 dirty|data_digest|omap_digest s 4194304 uv
> > 35356 dd f53dff2e od  alloc_hint [4194304 4194304 0]) data 
> > section offset=0
> > len=1048576 data section offset=1048576 len=1048576 data section
> > offset=2097152 len=1048576 data section offset=3145728 len=1048576 
> > attrs size
> > 2 omap map size 0 Read
> > #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head#
> > size=4194304
> > object_info:
> > 17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head(5163'713
> > 6
> > client.2074638.1:483264 dirty|data_digest|omap_digest s 4194304 uv
> > 7418 dd 43d61c5d od  alloc_hint [4194304 4194304 0]) data 
> > section offset=0
> > len=1048576 data section offset=1048576 len=1048576 data section
> > offset=2097152 len=1048576 data section offset=3145728 len=1048576 
> > attrs size
> > 2 omap map size 0 Read
> > #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head#
> > size=4194304
> > object_info:
> > 17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head(5236'764
> > 0
> > client.2074638.1:704364 dirty|data_digest|omap_digest s 4194304 uv
> > 7922 dd 3bcff64d od  alloc_hint [4194304 4194304 0]) data 
> > section offset=0
> > len=1048576 data section offset=1048576 len=1048576 data section
> > offset=2097152 len=1048576 data

Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-09 Thread Marc Roos
c:217] 
Shutdown: canceling all background work
2017-08-09 11:41:25.471514 7f26db8ae100  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown 
complete
2017-08-09 11:41:25.686088 7f26db8ae100  1 bluefs umount
2017-08-09 11:41:25.705389 7f26db8ae100  1 bdev(0x7f26de472e00 
/var/lib/ceph/osd/ceph-0/block) close
2017-08-09 11:41:25.944548 7f26db8ae100  1 bdev(0x7f26de2b3a00 
/var/lib/ceph/osd/ceph-0/block) close












-Original Message-
From: Sage Weil [mailto:s...@newdream.net] 
Sent: woensdag 9 augustus 2017 4:44
To: Brad Hubbard
Cc: Marc Roos; ceph-users
Subject: Re: [ceph-users] Pg inconsistent / export_files error -5

On Wed, 9 Aug 2017, Brad Hubbard wrote:
> Wee
> 
> On Wed, Aug 9, 2017 at 12:41 AM, Marc Roos <m.r...@f1-outsourcing.eu> 
wrote:
> >
> >
> >
> > The --debug indeed comes up with something
> > bluestore(/var/lib/ceph/osd/ceph-12) _verify_csum bad crc32c/0x1000 
> > checksum at blob offset 0x0, got 0x100ac314, expected 0x90407f75, 
> > device location [0x15a017~1000], logical extent 0x0~1000,
> >  bluestore(/var/lib/ceph/osd/ceph-9) _verify_csum bad crc32c/0x1000 
> > checksum at blob offset 0x0, got 0xb40b26a7, expected 0x90407f75, 
> > device location [0x2daea~1000], logical extent 0x0~1000,

What about the 3rd OSD?

It would be interesting to capture the fsck output for one of these.  
Stop the OSD, and then run

 ceph-bluestore-tool fsck --path /var/lib/ceph/osd/ceph-12 --log-file 
out \
--debug-bluestore 30 --no-log-to-stderr

That'll generate a pretty huge log, but should include dumps of onode 
metadata and will hopefully include something else with the checksum of
0x100ac314 so we can get some clue as to where the bad data came from.

Thanks!
sage


> >
> > I dont know how to interpret this, but am I correct to understand 
> > that data has been written across the cluster to these 3 osd's and 
> > all 3 have somehow received something different?
> 
> Did you run this command on OSD 0? What was the output in that case?
> 
> Possibly, all we currently know for sure is that the crc32c checksum 
> for the object on OSDs 12 and 9 do not match the expected checksum 
> according to the code when we attempt to read the object 
> #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#. There 
> seems to be some history behind this based on your previous emails 
> regarding these OSDs (12,9,0, and possibly 13) could you give us as 
> much detail as possible about how this issue came about and what you 
> have done in the interim to try to resolve it?
> 
> When was the first indication there was a problem with pg 17.36? Did 
> this correspond with any significant event?
> 
> Are these OSDs all on separate hosts?
> 
> It's possible ceph-bluestore-tool may help here but I would hold off 
> on that option until we understand the issue better.
> 
> 
> >
> >
> > size=4194304 object_info:
> > 17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head(5387'351
> > 57
> > client.2096993.0:78941 dirty|data_digest|omap_digest s 4194304 uv 
> > 35356 dd f53dff2e od  alloc_hint [4194304 4194304 0]) data 
> > section offset=0
> > len=1048576 data section offset=1048576 len=1048576 data section
> > offset=2097152 len=1048576 data section offset=3145728 len=1048576 
> > attrs size
> > 2 omap map size 0 Read
> > #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head# 
> > size=4194304
> > object_info:
> > 17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head(5163'713
> > 6
> > client.2074638.1:483264 dirty|data_digest|omap_digest s 4194304 uv 
> > 7418 dd 43d61c5d od  alloc_hint [4194304 4194304 0]) data 
> > section offset=0
> > len=1048576 data section offset=1048576 len=1048576 data section
> > offset=2097152 len=1048576 data section offset=3145728 len=1048576 
> > attrs size
> > 2 omap map size 0 Read
> > #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head# 
> > size=4194304
> > object_info:
> > 17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head(5236'764
> > 0
> > client.2074638.1:704364 dirty|data_digest|omap_digest s 4194304 uv 
> > 7922 dd 3bcff64d od  alloc_hint [4194304 4194304 0]) data 
> > section offset=0
> > len=1048576 data section offset=1048576 len=1048576 data section
> > offset=2097152 len=1048576 data section offset=3145728 len=1048576 
> > attrs size
> > 2 omap map size 0 Read
> > #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head# 
> > si

Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-08 Thread Sage Weil
e10f5100 -1
> > bluestore(/var/lib/ceph/osd/ceph-9) _verify_csum bad crc32c/0x1000 checksum 
> > at
> > blob offset 0x0, got 0xb40b26a7, expected 0x90407f75, device location
> > [0x2daea~1000], logical extent 0x0~1000, object
> > #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4# export_files 
> > error
> > -5 2017-08-08 16:22:00.895439 7f94e10f5100  1
> > bluestore(/var/lib/ceph/osd/ceph-9) umount 2017-08-08 16:22:00.963774
> > 7f94e10f5100  1 freelist shutdown 2017-08-08 16:22:00.963861 7f94e10f5100  4
> > rocksdb:
> > [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> > CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> > 12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/rocksdb/db/db_impl.cc:217] Shutdown:
> > canceling all background work 2017-08-08 16:22:00.968438 7f94e10f5100  4
> > rocksdb:
> > [/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_AR
> > CH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/
> > 12.1.1/rpm/el7/BUILD/ceph-12.1.1/src/rocksdb/db/db_impl.cc:343] Shutdown
> > complete 2017-08-08 16:22:00.984583 7f94e10f5100  1 bluefs umount 2017-08-08
> > 16:22:01.026784 7f94e10f5100  1 bdev(0x7f94e3670e00
> > /var/lib/ceph/osd/ceph-9/block) close 2017-08-08 16:22:01.243361 
> > 7f94e10f5100
> > 1 bdev(0x7f94e34b5a00 /var/lib/ceph/osd/ceph-9/block) close
> >
> >
> > 23555 16:26:31.336061 io_getevents(139955679129600, 1, 16,  
> > 23552 16:26:31.336081 futex(0x7ffe7e4c9210, FUTEX_WAKE_PRIVATE, 1) = 0
> > <0.000155> 23552 16:26:31.336452 futex(0x7f49fb4d20bc, 
> > FUTEX_WAKE_OP_PRIVATE,
> > 1, 1, 0x7f49fb4d20b8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 <0.000129>
> > 23553 16:26:31.336637 <... futex resumed> ) = 0 <16.434259> 23553
> > 16:26:31.336758 futex(0x7f49fb4d2038, FUTEX_WAKE_PRIVATE, 1 
> > 23552 16:26:31.336801 madvise(0x7f4a0cafa000, 2555904, MADV_DONTNEED
> >  23553 16:26:31.336915 <... futex resumed> ) = 0 <0.000113>
> > 23552 16:26:31.336959 <... madvise resumed> ) = 0 <0.000148> 23553
> > 16:26:31.337040 futex(0x7f49fb4d20bc, FUTEX_WAIT_PRIVATE, 55, NULL 
> >  > ...> 23552 16:26:31.337070 madvise(0x7f4a0ca7a000, 3080192, MADV_DONTNEED) 
> > = 0
> > <0.000180> 23552 16:26:31.337424 write(2, "export_files error ",
> > 19) = 19 <0.000104> 23552 16:26:31.337615 write(2, "-5", 2) = 2
> > <0.17> 23552 16:26:31.337674 write(2, "\n", 1) = 1 
> > <0.37>
> > 23552 16:26:31.338270 madvise(0x7f4a01ae4000, 16384, MADV_DONTNEED) = 0
> > <0.20> 23552 16:26:31.338320 madvise(0x7f4a018cc000, 49152, 
> > MADV_DONTNEED)
> > = 0 <0.14> 23552 16:26:31.338561 madvise(0x7f4a0770a000, 24576,
> > MADV_DONTNEED) = 0 <0.15> 23552 16:26:31.339161 madvise(0x7f4a02102000,
> > 16384, MADV_DONTNEED) = 0 <0.15> 23552 16:26:31.339201
> > madvise(0x7f4a02132000, 16384, MADV_DONTNEED) = 0 <0.13> 23552
> > 16:26:31.339235 madvise(0x7f4a02102000, 32768, MADV_DONTNEED) = 0 <0.14>
> > 23552 16:26:31.339331 madvise(0x7f4a01df8000, 16384, MADV_DONTNEED) = 0
> > <0.19> 23552 16:26:31.339372 madvise(0x7f4a01df8000, 32768, 
> > MADV_DONTNEED)
> > = 0 <0.13>
> >
> >
> > -Original Message- From: Brad Hubbard [mailto:bhubb...@redhat.com]
> > Sent: 07 August 2017 02:34 To: Marc Roos Cc: ceph-users Subject: Re:
> > [ceph-users] Pg inconsistent / export_files error -5
> >
> >
> >
> > On Sat, Aug 5, 2017 at 1:21 AM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
> >>
> >> I have got a placement group inconsistency, and saw some manual where you 
> >> can
> >> export and import this on another osd. But I am getting an export error on
> >> every osd.
> >>
> >> What does this export_files error -5 actually mean? I thought 3 copies
> >
> > #define EIO  5  /* I/O error */
> >
> >> should be enough to secure your data.
> >>
> >>
> >>> PG_DAMAGED Possible data damage: 1 pg inconsistent pg 17.36 is
> >>> active+clean+inconsistent, acting [9,0,12]
> >>
> >>
> >>> 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) log
> >> [ERR] : 17.36 soid 
> >> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4:
> >> failed to pick suitable object info
> >>> 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) log
> >> [ERR] :

Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-08 Thread Brad Hubbard
vents(139955679129600, 1, 16,  
> 23552 16:26:31.336081 futex(0x7ffe7e4c9210, FUTEX_WAKE_PRIVATE, 1) = 0
> <0.000155> 23552 16:26:31.336452 futex(0x7f49fb4d20bc, FUTEX_WAKE_OP_PRIVATE,
> 1, 1, 0x7f49fb4d20b8, {FUTEX_OP_SET, 0, FUTEX_OP_CMP_GT, 1}) = 1 <0.000129>
> 23553 16:26:31.336637 <... futex resumed> ) = 0 <16.434259> 23553
> 16:26:31.336758 futex(0x7f49fb4d2038, FUTEX_WAKE_PRIVATE, 1 
> 23552 16:26:31.336801 madvise(0x7f4a0cafa000, 2555904, MADV_DONTNEED
>  23553 16:26:31.336915 <... futex resumed> ) = 0 <0.000113>
> 23552 16:26:31.336959 <... madvise resumed> ) = 0 <0.000148> 23553
> 16:26:31.337040 futex(0x7f49fb4d20bc, FUTEX_WAIT_PRIVATE, 55, NULL  ...> 23552 16:26:31.337070 madvise(0x7f4a0ca7a000, 3080192, MADV_DONTNEED) = 0
> <0.000180> 23552 16:26:31.337424 write(2, "export_files error ",
> 19) = 19 <0.000104> 23552 16:26:31.337615 write(2, "-5", 2) = 2
> <0.17> 23552 16:26:31.337674 write(2, "\n", 1) = 1 <0.37>
> 23552 16:26:31.338270 madvise(0x7f4a01ae4000, 16384, MADV_DONTNEED) = 0
> <0.20> 23552 16:26:31.338320 madvise(0x7f4a018cc000, 49152, MADV_DONTNEED)
> = 0 <0.14> 23552 16:26:31.338561 madvise(0x7f4a0770a000, 24576,
> MADV_DONTNEED) = 0 <0.15> 23552 16:26:31.339161 madvise(0x7f4a02102000,
> 16384, MADV_DONTNEED) = 0 <0.15> 23552 16:26:31.339201
> madvise(0x7f4a02132000, 16384, MADV_DONTNEED) = 0 <0.13> 23552
> 16:26:31.339235 madvise(0x7f4a02102000, 32768, MADV_DONTNEED) = 0 <0.14>
> 23552 16:26:31.339331 madvise(0x7f4a01df8000, 16384, MADV_DONTNEED) = 0
> <0.19> 23552 16:26:31.339372 madvise(0x7f4a01df8000, 32768, MADV_DONTNEED)
> = 0 <0.13>
>
>
> -Original Message- From: Brad Hubbard [mailto:bhubb...@redhat.com]
> Sent: 07 August 2017 02:34 To: Marc Roos Cc: ceph-users Subject: Re:
> [ceph-users] Pg inconsistent / export_files error -5
>
>
>
> On Sat, Aug 5, 2017 at 1:21 AM, Marc Roos <m.r...@f1-outsourcing.eu> wrote:
>>
>> I have got a placement group inconsistency, and saw some manual where you can
>> export and import this on another osd. But I am getting an export error on
>> every osd.
>>
>> What does this export_files error -5 actually mean? I thought 3 copies
>
> #define EIO  5  /* I/O error */
>
>> should be enough to secure your data.
>>
>>
>>> PG_DAMAGED Possible data damage: 1 pg inconsistent pg 17.36 is
>>> active+clean+inconsistent, acting [9,0,12]
>>
>>
>>> 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) log
>> [ERR] : 17.36 soid 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4:
>> failed to pick suitable object info
>>> 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) log
>> [ERR] : 17.36 deep-scrub 3 errors
>>> 2017-08-04 15:21:12.445799 7f2f623d6700 -1 log_channel(cluster) log
>> [ERR] : 17.36 soid 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4:
>> failed to pick suitable object info
>>> 2017-08-04 15:22:35.646635 7f2f623d6700 -1 log_channel(cluster) log
>> [ERR] : 17.36 repair 3 errors, 0 fixed
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pgid 17.36 --op
>> export --file /tmp/recover.17.36
>
> Can you run this command under strace like so?
>
> # strace -fvttyyTo /tmp/strace.out -s 1024 ceph-objectstore-tool --data-path
> /var/lib/ceph/osd/ceph-12 --pgid 17.36 --op export --file /tmp/recover.17.36
>
> Then see if you can find which syscall is returning EIO.
>
> # grep "= \-5" /tmp/strace.out
>
>>
>> ... Read #17:6c9f811c:::rbd_data.1b42f52ae8944a.1a32:head# Read
>> #17:6ca035fc:::rbd_data.1fff61238e1f29.b31a:head# Read
>> #17:6ca0b4f8:::rbd_data.1fff61238e1f29.6fcc:head# Read
>> #17:6ca0ffbc:::rbd_data.1fff61238e1f29.a214:head# Read
>> #17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head# Read
>> #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head# Read
>> #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head# Read
>> #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head# Read
>> #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4# export_files error
>> -5
>
> Running the command with "--debug" appended will give more output which may
> shed more light as well.
>
>> ___ ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> -- Cheers, Brad
>
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-08 Thread Marc Roos
ise(0x7f4a01df8000, 16384, MADV_DONTNEED) = 0 
<0.19>
23552 16:26:31.339372 madvise(0x7f4a01df8000, 32768, MADV_DONTNEED) = 0 
<0.13>


-Original Message-
From: Brad Hubbard [mailto:bhubb...@redhat.com] 
Sent: 07 August 2017 02:34
To: Marc Roos
Cc: ceph-users
Subject: Re: [ceph-users] Pg inconsistent / export_files error -5



On Sat, Aug 5, 2017 at 1:21 AM, Marc Roos <m.r...@f1-outsourcing.eu> 
wrote:
>
> I have got a placement group inconsistency, and saw some manual where 
> you can export and import this on another osd. But I am getting an 
> export error on every osd.
>
> What does this export_files error -5 actually mean? I thought 3 copies

#define EIO  5  /* I/O error */

> should be enough to secure your data.
>
>
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>pg 17.36 is active+clean+inconsistent, acting [9,0,12]
>
>
>> 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 soid
> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to 
> pick suitable object info
>> 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 deep-scrub 3 errors
>> 2017-08-04 15:21:12.445799 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 soid
> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to 
> pick suitable object info
>> 2017-08-04 15:22:35.646635 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 repair 3 errors, 0 fixed
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pgid 
> 17.36 --op export --file /tmp/recover.17.36

Can you run this command under strace like so?

# strace -fvttyyTo /tmp/strace.out -s 1024 ceph-objectstore-tool 
--data-path /var/lib/ceph/osd/ceph-12 --pgid 17.36 --op export --file 
/tmp/recover.17.36

Then see if you can find which syscall is returning EIO.

# grep "= \-5" /tmp/strace.out

>
> ...
> Read #17:6c9f811c:::rbd_data.1b42f52ae8944a.1a32:head#
> Read #17:6ca035fc:::rbd_data.1fff61238e1f29.b31a:head#
> Read #17:6ca0b4f8:::rbd_data.1fff61238e1f29.6fcc:head#
> Read #17:6ca0ffbc:::rbd_data.1fff61238e1f29.a214:head#
> Read #17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head#
> Read #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head#
> Read #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head#
> Read #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head#
> Read #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#
> export_files error -5

Running the command with "--debug" appended will give more output which 
may shed more light as well.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Cheers,
Brad


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-06 Thread Brad Hubbard


On Sat, Aug 5, 2017 at 1:21 AM, Marc Roos  wrote:
>
> I have got a placement group inconsistency, and saw some manual where
> you can export and import this on another osd. But I am getting an
> export error on every osd.
>
> What does this export_files error -5 actually mean? I thought 3 copies

#define EIO  5  /* I/O error */

> should be enough to secure your data.
>
>
>> PG_DAMAGED Possible data damage: 1 pg inconsistent
>>pg 17.36 is active+clean+inconsistent, acting [9,0,12]
>
>
>> 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 soid
> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to pick
> suitable object info
>> 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 deep-scrub 3 errors
>> 2017-08-04 15:21:12.445799 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 soid
> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to pick
> suitable object info
>> 2017-08-04 15:22:35.646635 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 repair 3 errors, 0 fixed
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pgid 17.36
> --op export --file /tmp/recover.17.36

Can you run this command under strace like so?

# strace -fvttyyTo /tmp/strace.out -s 1024 ceph-objectstore-tool --data-path 
/var/lib/ceph/osd/ceph-12 --pgid 17.36 --op export --file /tmp/recover.17.36

Then see if you can find which syscall is returning EIO.

# grep "= \-5" /tmp/strace.out

>
> ...
> Read #17:6c9f811c:::rbd_data.1b42f52ae8944a.1a32:head#
> Read #17:6ca035fc:::rbd_data.1fff61238e1f29.b31a:head#
> Read #17:6ca0b4f8:::rbd_data.1fff61238e1f29.6fcc:head#
> Read #17:6ca0ffbc:::rbd_data.1fff61238e1f29.a214:head#
> Read #17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head#
> Read #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head#
> Read #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head#
> Read #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head#
> Read #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#
> export_files error -5

Running the command with "--debug" appended will give more output which may shed
more light as well.

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-04 Thread Marc Roos
 
I am still on 12.1.1, it is still a test 3 node cluster, nothing much 
happening. 2nd node had some issues a while ago, I had an osd.8 that 
didn’t want to start so I replaced it. 



-Original Message-
From: David Turner [mailto:drakonst...@gmail.com] 
Sent: vrijdag 4 augustus 2017 17:52
To: Marc Roos; ceph-users
Subject: Re: [ceph-users] Pg inconsistent / export_files error -5

It _should_ be enough. What happened in your cluster recently? Power 
Outage, OSD failures, upgrade, added new hardware, any changes at all. 
What is your Ceph version?

On Fri, Aug 4, 2017 at 11:22 AM Marc Roos <m.r...@f1-outsourcing.eu> 
wrote:



I have got a placement group inconsistency, and saw some manual 
where
you can export and import this on another osd. But I am getting an
export error on every osd.

What does this export_files error -5 actually mean? I thought 3 
copies
should be enough to secure your data.


> PG_DAMAGED Possible data damage: 1 pg inconsistent
>pg 17.36 is active+clean+inconsistent, acting [9,0,12]


> 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) 
log
[ERR] : 17.36 soid
17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to 
pick
suitable object info
> 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) 
log
[ERR] : 17.36 deep-scrub 3 errors
> 2017-08-04 15:21:12.445799 7f2f623d6700 -1 log_channel(cluster) 
log
[ERR] : 17.36 soid
17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to 
pick
suitable object info
> 2017-08-04 15:22:35.646635 7f2f623d6700 -1 log_channel(cluster) 
log
[ERR] : 17.36 repair 3 errors, 0 fixed

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pgid 
17.36
--op export --file /tmp/recover.17.36

...
Read #17:6c9f811c:::rbd_data.1b42f52ae8944a.1a32:head#
Read #17:6ca035fc:::rbd_data.1fff61238e1f29.b31a:head#
Read #17:6ca0b4f8:::rbd_data.1fff61238e1f29.6fcc:head#
Read #17:6ca0ffbc:::rbd_data.1fff61238e1f29.a214:head#
Read #17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head#
Read #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head#
Read #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head#
Read #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head#
Read #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#
export_files error -5
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Pg inconsistent / export_files error -5

2017-08-04 Thread David Turner
It _should_ be enough. What happened in your cluster recently? Power
Outage, OSD failures, upgrade, added new hardware, any changes at all. What
is your Ceph version?

On Fri, Aug 4, 2017 at 11:22 AM Marc Roos  wrote:

>
> I have got a placement group inconsistency, and saw some manual where
> you can export and import this on another osd. But I am getting an
> export error on every osd.
>
> What does this export_files error -5 actually mean? I thought 3 copies
> should be enough to secure your data.
>
>
> > PG_DAMAGED Possible data damage: 1 pg inconsistent
> >pg 17.36 is active+clean+inconsistent, acting [9,0,12]
>
>
> > 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 soid
> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to pick
> suitable object info
> > 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 deep-scrub 3 errors
> > 2017-08-04 15:21:12.445799 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 soid
> 17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to pick
> suitable object info
> > 2017-08-04 15:22:35.646635 7f2f623d6700 -1 log_channel(cluster) log
> [ERR] : 17.36 repair 3 errors, 0 fixed
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pgid 17.36
> --op export --file /tmp/recover.17.36
>
> ...
> Read #17:6c9f811c:::rbd_data.1b42f52ae8944a.1a32:head#
> Read #17:6ca035fc:::rbd_data.1fff61238e1f29.b31a:head#
> Read #17:6ca0b4f8:::rbd_data.1fff61238e1f29.6fcc:head#
> Read #17:6ca0ffbc:::rbd_data.1fff61238e1f29.a214:head#
> Read #17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head#
> Read #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head#
> Read #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head#
> Read #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head#
> Read #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#
> export_files error -5
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Pg inconsistent / export_files error -5

2017-08-04 Thread Marc Roos

I have got a placement group inconsistency, and saw some manual where 
you can export and import this on another osd. But I am getting an 
export error on every osd. 

What does this export_files error -5 actually mean? I thought 3 copies 
should be enough to secure your data.


> PG_DAMAGED Possible data damage: 1 pg inconsistent
>pg 17.36 is active+clean+inconsistent, acting [9,0,12]


> 2017-08-04 05:39:51.534489 7f2f623d6700 -1 log_channel(cluster) log 
[ERR] : 17.36 soid 
17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to pick 
suitable object info
> 2017-08-04 05:41:12.715393 7f2f623d6700 -1 log_channel(cluster) log 
[ERR] : 17.36 deep-scrub 3 errors
> 2017-08-04 15:21:12.445799 7f2f623d6700 -1 log_channel(cluster) log 
[ERR] : 17.36 soid 
17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4: failed to pick 
suitable object info
> 2017-08-04 15:22:35.646635 7f2f623d6700 -1 log_channel(cluster) log 
[ERR] : 17.36 repair 3 errors, 0 fixed

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-12 --pgid 17.36 
--op export --file /tmp/recover.17.36

...
Read #17:6c9f811c:::rbd_data.1b42f52ae8944a.1a32:head#
Read #17:6ca035fc:::rbd_data.1fff61238e1f29.b31a:head#
Read #17:6ca0b4f8:::rbd_data.1fff61238e1f29.6fcc:head#
Read #17:6ca0ffbc:::rbd_data.1fff61238e1f29.a214:head#
Read #17:6ca10b29:::rbd_data.1fff61238e1f29.9923:head#
Read #17:6ca11ab9:::rbd_data.1fa8ef2ae8944a.11b4:head#
Read #17:6ca13bed:::rbd_data.1f114174b0dc51.02c6:head#
Read #17:6ca1a791:::rbd_data.1fff61238e1f29.f101:head#
Read #17:6ca1f70a:::rbd_data.1f114174b0dc51.0974:4#
export_files error -5
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] pg inconsistent : found clone without head

2013-11-26 Thread Laurent Barbe

Hello,

log [INF] : 3.136 repair ok, 0 fixed

Thank you Greg, I did like that, it worked well.


Laurent


Le 25/11/2013 19:10, Gregory Farnum a écrit :

On Mon, Nov 25, 2013 at 8:10 AM, Laurent Barbe laur...@ksperis.com wrote:

Hello,

Since yesterday, scrub has detected an inconsistent pg :( :

# ceph health detail(ceph version 0.61.9)
HEALTH_ERR 1 pgs inconsistent; 9 scrub errors
pg 3.136 is active+clean+inconsistent, acting [9,1]
9 scrub errors

# ceph pg map 3.136
osdmap e4363 pg 3.136 (3.136) - up [9,1] acting [9,1]

But when I try to repair, osd.9 daemon failed :

# ceph pg repair 3.136
instructing pg 3.136 on osd.9 to repair

2013-11-25 10:04:09.758845 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
96ad1336/rb.0.32a6.238e1f29.00034d6a/5ab//3
2013-11-25 10:04:09.759862 7fc2f0706700  0 log [ERR] : repair 3.136
96ad1336/rb.0.32a6.238e1f29.00034d6a/5ab//3 found clone without head
2013-11-25 10:04:12.872908 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
e5822336/rb.0.32a6.238e1f29.00036552/5b3//3
2013-11-25 10:04:12.873064 7fc2f0706700  0 log [ERR] : repair 3.136
e5822336/rb.0.32a6.238e1f29.00036552/5b3//3 found clone without head
2013-11-25 10:04:14.497750 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
38372336/rb.0.32a6.238e1f29.00011379/5bb//3
2013-11-25 10:04:14.497796 7fc2f0706700  0 log [ERR] : repair 3.136
38372336/rb.0.32a6.238e1f29.00011379/5bb//3 found clone without head
2013-11-25 10:04:57.557894 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
109b8336/rb.0.32a6.238e1f29.0003ad6b/5ab//3
2013-11-25 10:04:57.558052 7fc2f0706700  0 log [ERR] : repair 3.136
109b8336/rb.0.32a6.238e1f29.0003ad6b/5ab//3 found clone without head
2013-11-25 10:17:45.835145 7fc2f0706700  0 log [ERR] : 3.136 repair stat
mismatch, got 8289/8292 objects, 1981/1984 clones, 26293444608/26294251520
bytes.
2013-11-25 10:17:45.835248 7fc2f0706700  0 log [ERR] : 3.136 repair 4
missing, 0 inconsistent objects
2013-11-25 10:17:45.835320 7fc2f0706700  0 log [ERR] : 3.136 repair 9
errors, 5 fixed
2013-11-25 10:17:45.839963 7fc2f0f07700 -1 osd/ReplicatedPG.cc: In function
'int ReplicatedPG::recover_primary(int)' thread 7fc2f0f07700 time 2013-11-25
10:17:45.836790
osd/ReplicatedPG.cc: 6643: FAILED assert(latest-is_update())


The object (found clone without head) concern the rbd images below (which is
in use) :

# rbd info datashare/share3
rbd image 'share3':
 size 1024 GB in 262144 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.32a6.238e1f29
 format: 1


Directory contents :
In OSD.9 (Primary) :
/var/lib/ceph/osd/ceph-9/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
rb.0.32a6.238e1f29.00034d6a*
-rw-r--r-- 1 root root 4194304 nov.   6 02:25
rb.0.32a6.238e1f29.00034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   8 02:40
rb.0.32a6.238e1f29.00034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   9 02:44
rb.0.32a6.238e1f29.00034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  12 02:52
rb.0.32a6.238e1f29.00034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  14 02:39
rb.0.32a6.238e1f29.00034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  16 02:45
rb.0.32a6.238e1f29.00034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  19 01:59
rb.0.32a6.238e1f29.00034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  20 02:25
rb.0.32a6.238e1f29.00034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  22 02:18
rb.0.32a6.238e1f29.00034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.00034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.00034d6a__head_96AD1336__3

In OSD.1 (Replica) :
/var/lib/ceph/osd/ceph-1/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
rb.0.32a6.238e1f29.00034d6a*
-rw-r--r-- 1 root root 4194304 oct.  11 17:13
rb.0.32a6.238e1f29.00034d6a__5ab_96AD1336__3   --- 
-rw-r--r-- 1 root root 4194304 nov.   6 02:25
rb.0.32a6.238e1f29.00034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   8 02:40
rb.0.32a6.238e1f29.00034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   9 02:44
rb.0.32a6.238e1f29.00034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  12 02:52
rb.0.32a6.238e1f29.00034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  14 02:39
rb.0.32a6.238e1f29.00034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  16 02:45
rb.0.32a6.238e1f29.00034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  19 01:59
rb.0.32a6.238e1f29.00034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  20 02:25
rb.0.32a6.238e1f29.00034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  22 02:18
rb.0.32a6.238e1f29.00034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24
rb.0.32a6.238e1f29.00034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24

[ceph-users] pg inconsistent : found clone without head

2013-11-25 Thread Laurent Barbe

Hello,

Since yesterday, scrub has detected an inconsistent pg :( :

# ceph health detail(ceph version 0.61.9)
HEALTH_ERR 1 pgs inconsistent; 9 scrub errors
pg 3.136 is active+clean+inconsistent, acting [9,1]
9 scrub errors

# ceph pg map 3.136
osdmap e4363 pg 3.136 (3.136) - up [9,1] acting [9,1]

But when I try to repair, osd.9 daemon failed :

# ceph pg repair 3.136
instructing pg 3.136 on osd.9 to repair

2013-11-25 10:04:09.758845 7fc2f0706700  0 log [ERR] : 3.136 osd.9 
missing 96ad1336/rb.0.32a6.238e1f29.00034d6a/5ab//3
2013-11-25 10:04:09.759862 7fc2f0706700  0 log [ERR] : repair 3.136 
96ad1336/rb.0.32a6.238e1f29.00034d6a/5ab//3 found clone without head
2013-11-25 10:04:12.872908 7fc2f0706700  0 log [ERR] : 3.136 osd.9 
missing e5822336/rb.0.32a6.238e1f29.00036552/5b3//3
2013-11-25 10:04:12.873064 7fc2f0706700  0 log [ERR] : repair 3.136 
e5822336/rb.0.32a6.238e1f29.00036552/5b3//3 found clone without head
2013-11-25 10:04:14.497750 7fc2f0706700  0 log [ERR] : 3.136 osd.9 
missing 38372336/rb.0.32a6.238e1f29.00011379/5bb//3
2013-11-25 10:04:14.497796 7fc2f0706700  0 log [ERR] : repair 3.136 
38372336/rb.0.32a6.238e1f29.00011379/5bb//3 found clone without head
2013-11-25 10:04:57.557894 7fc2f0706700  0 log [ERR] : 3.136 osd.9 
missing 109b8336/rb.0.32a6.238e1f29.0003ad6b/5ab//3
2013-11-25 10:04:57.558052 7fc2f0706700  0 log [ERR] : repair 3.136 
109b8336/rb.0.32a6.238e1f29.0003ad6b/5ab//3 found clone without head
2013-11-25 10:17:45.835145 7fc2f0706700  0 log [ERR] : 3.136 repair stat 
mismatch, got 8289/8292 objects, 1981/1984 clones, 
26293444608/26294251520 bytes.
2013-11-25 10:17:45.835248 7fc2f0706700  0 log [ERR] : 3.136 repair 4 
missing, 0 inconsistent objects
2013-11-25 10:17:45.835320 7fc2f0706700  0 log [ERR] : 3.136 repair 9 
errors, 5 fixed
2013-11-25 10:17:45.839963 7fc2f0f07700 -1 osd/ReplicatedPG.cc: In 
function 'int ReplicatedPG::recover_primary(int)' thread 7fc2f0f07700 
time 2013-11-25 10:17:45.836790

osd/ReplicatedPG.cc: 6643: FAILED assert(latest-is_update())


The object (found clone without head) concern the rbd images below 
(which is in use) :


# rbd info datashare/share3
rbd image 'share3':
size 1024 GB in 262144 objects
order 22 (4096 KB objects)
block_name_prefix: rb.0.32a6.238e1f29
format: 1


Directory contents :
In OSD.9 (Primary) :
/var/lib/ceph/osd/ceph-9/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls 
-l rb.0.32a6.238e1f29.00034d6a*
-rw-r--r-- 1 root root 4194304 nov.   6 02:25 
rb.0.32a6.238e1f29.00034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   8 02:40 
rb.0.32a6.238e1f29.00034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   9 02:44 
rb.0.32a6.238e1f29.00034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  12 02:52 
rb.0.32a6.238e1f29.00034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  14 02:39 
rb.0.32a6.238e1f29.00034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  16 02:45 
rb.0.32a6.238e1f29.00034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  19 01:59 
rb.0.32a6.238e1f29.00034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  20 02:25 
rb.0.32a6.238e1f29.00034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  22 02:18 
rb.0.32a6.238e1f29.00034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24 
rb.0.32a6.238e1f29.00034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24 
rb.0.32a6.238e1f29.00034d6a__head_96AD1336__3


In OSD.1 (Replica) :
/var/lib/ceph/osd/ceph-1/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls 
-l rb.0.32a6.238e1f29.00034d6a*
-rw-r--r-- 1 root root 4194304 oct.  11 17:13 
rb.0.32a6.238e1f29.00034d6a__5ab_96AD1336__3   --- 
-rw-r--r-- 1 root root 4194304 nov.   6 02:25 
rb.0.32a6.238e1f29.00034d6a__7ed_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   8 02:40 
rb.0.32a6.238e1f29.00034d6a__7f5_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.   9 02:44 
rb.0.32a6.238e1f29.00034d6a__7fd_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  12 02:52 
rb.0.32a6.238e1f29.00034d6a__815_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  14 02:39 
rb.0.32a6.238e1f29.00034d6a__825_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  16 02:45 
rb.0.32a6.238e1f29.00034d6a__835_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  19 01:59 
rb.0.32a6.238e1f29.00034d6a__84d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  20 02:25 
rb.0.32a6.238e1f29.00034d6a__855_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  22 02:18 
rb.0.32a6.238e1f29.00034d6a__865_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24 
rb.0.32a6.238e1f29.00034d6a__86d_96AD1336__3
-rw-r--r-- 1 root root 4194304 nov.  23 02:24 
rb.0.32a6.238e1f29.00034d6a__head_96AD1336__3



The file rb.0.32a6.238e1f29.00034d6a__5ab_96AD1336__3 is only 
present on replica on osd.1. It seems that this snapshot (5ab) no longer 
exists.


Re: [ceph-users] pg inconsistent : found clone without head

2013-11-25 Thread Gregory Farnum
On Mon, Nov 25, 2013 at 8:10 AM, Laurent Barbe laur...@ksperis.com wrote:
 Hello,

 Since yesterday, scrub has detected an inconsistent pg :( :

 # ceph health detail(ceph version 0.61.9)
 HEALTH_ERR 1 pgs inconsistent; 9 scrub errors
 pg 3.136 is active+clean+inconsistent, acting [9,1]
 9 scrub errors

 # ceph pg map 3.136
 osdmap e4363 pg 3.136 (3.136) - up [9,1] acting [9,1]

 But when I try to repair, osd.9 daemon failed :

 # ceph pg repair 3.136
 instructing pg 3.136 on osd.9 to repair

 2013-11-25 10:04:09.758845 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
 96ad1336/rb.0.32a6.238e1f29.00034d6a/5ab//3
 2013-11-25 10:04:09.759862 7fc2f0706700  0 log [ERR] : repair 3.136
 96ad1336/rb.0.32a6.238e1f29.00034d6a/5ab//3 found clone without head
 2013-11-25 10:04:12.872908 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
 e5822336/rb.0.32a6.238e1f29.00036552/5b3//3
 2013-11-25 10:04:12.873064 7fc2f0706700  0 log [ERR] : repair 3.136
 e5822336/rb.0.32a6.238e1f29.00036552/5b3//3 found clone without head
 2013-11-25 10:04:14.497750 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
 38372336/rb.0.32a6.238e1f29.00011379/5bb//3
 2013-11-25 10:04:14.497796 7fc2f0706700  0 log [ERR] : repair 3.136
 38372336/rb.0.32a6.238e1f29.00011379/5bb//3 found clone without head
 2013-11-25 10:04:57.557894 7fc2f0706700  0 log [ERR] : 3.136 osd.9 missing
 109b8336/rb.0.32a6.238e1f29.0003ad6b/5ab//3
 2013-11-25 10:04:57.558052 7fc2f0706700  0 log [ERR] : repair 3.136
 109b8336/rb.0.32a6.238e1f29.0003ad6b/5ab//3 found clone without head
 2013-11-25 10:17:45.835145 7fc2f0706700  0 log [ERR] : 3.136 repair stat
 mismatch, got 8289/8292 objects, 1981/1984 clones, 26293444608/26294251520
 bytes.
 2013-11-25 10:17:45.835248 7fc2f0706700  0 log [ERR] : 3.136 repair 4
 missing, 0 inconsistent objects
 2013-11-25 10:17:45.835320 7fc2f0706700  0 log [ERR] : 3.136 repair 9
 errors, 5 fixed
 2013-11-25 10:17:45.839963 7fc2f0f07700 -1 osd/ReplicatedPG.cc: In function
 'int ReplicatedPG::recover_primary(int)' thread 7fc2f0f07700 time 2013-11-25
 10:17:45.836790
 osd/ReplicatedPG.cc: 6643: FAILED assert(latest-is_update())


 The object (found clone without head) concern the rbd images below (which is
 in use) :

 # rbd info datashare/share3
 rbd image 'share3':
 size 1024 GB in 262144 objects
 order 22 (4096 KB objects)
 block_name_prefix: rb.0.32a6.238e1f29
 format: 1


 Directory contents :
 In OSD.9 (Primary) :
 /var/lib/ceph/osd/ceph-9/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
 rb.0.32a6.238e1f29.00034d6a*
 -rw-r--r-- 1 root root 4194304 nov.   6 02:25
 rb.0.32a6.238e1f29.00034d6a__7ed_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.   8 02:40
 rb.0.32a6.238e1f29.00034d6a__7f5_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.   9 02:44
 rb.0.32a6.238e1f29.00034d6a__7fd_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  12 02:52
 rb.0.32a6.238e1f29.00034d6a__815_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  14 02:39
 rb.0.32a6.238e1f29.00034d6a__825_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  16 02:45
 rb.0.32a6.238e1f29.00034d6a__835_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  19 01:59
 rb.0.32a6.238e1f29.00034d6a__84d_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  20 02:25
 rb.0.32a6.238e1f29.00034d6a__855_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  22 02:18
 rb.0.32a6.238e1f29.00034d6a__865_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  23 02:24
 rb.0.32a6.238e1f29.00034d6a__86d_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  23 02:24
 rb.0.32a6.238e1f29.00034d6a__head_96AD1336__3

 In OSD.1 (Replica) :
 /var/lib/ceph/osd/ceph-1/current/3.136_head/DIR_6/DIR_3/DIR_3/DIR_1# ls -l
 rb.0.32a6.238e1f29.00034d6a*
 -rw-r--r-- 1 root root 4194304 oct.  11 17:13
 rb.0.32a6.238e1f29.00034d6a__5ab_96AD1336__3   --- 
 -rw-r--r-- 1 root root 4194304 nov.   6 02:25
 rb.0.32a6.238e1f29.00034d6a__7ed_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.   8 02:40
 rb.0.32a6.238e1f29.00034d6a__7f5_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.   9 02:44
 rb.0.32a6.238e1f29.00034d6a__7fd_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  12 02:52
 rb.0.32a6.238e1f29.00034d6a__815_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  14 02:39
 rb.0.32a6.238e1f29.00034d6a__825_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  16 02:45
 rb.0.32a6.238e1f29.00034d6a__835_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  19 01:59
 rb.0.32a6.238e1f29.00034d6a__84d_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  20 02:25
 rb.0.32a6.238e1f29.00034d6a__855_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  22 02:18
 rb.0.32a6.238e1f29.00034d6a__865_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  23 02:24
 rb.0.32a6.238e1f29.00034d6a__86d_96AD1336__3
 -rw-r--r-- 1 root root 4194304 nov.  23 02:24
 rb.0.32a6.238e1f29.00034d6a__head_96AD1336__3


 The file