Re: [ceph-users] scrub errors on rgw data pool

2019-11-29 Thread M Ranga Swami Reddy
Primary OSD crashes with below assert:
12.2.11/src/osd/ReplicatedBackend.cc:1445 assert(peer_missing.count(
fromshard))
==
here I have 2 OSDs with bluestore backend and 1 osd with filestore backend.

On Mon, Nov 25, 2019 at 3:34 PM M Ranga Swami Reddy 
wrote:

> Hello - We are using the ceph 12.2.11 version (upgraded from Jewel 10.2.12
> to 12.2.11). In this cluster, we are having mix of filestore and bluestore
> OSD backends.
> Recently we are seeing the scrub errors on rgw buckets.data pool every
> day, after scrub operation performed by Ceph. If we run the PG repair, the
> errors will go way.
>
> Anyone seen the above issue?
> Is the mix of filestore backend has bug/issue with 12.2.11 version (ie
> Luminous).
> Is the mix of filestore and bluestore OSDs cause this type of issue?
>
> Thanks
> Swami
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors on rgw data pool

2019-11-25 Thread M Ranga Swami Reddy
Thanks for reply
Have you migrated all filestore OSDs from filestore backend to bluestore
backend?
Or
Have you upgraded from Luminious 12.2.11 to 14.x?

What helped here?


On Tue, Nov 26, 2019 at 8:03 AM Fyodor Ustinov  wrote:

> Hi!
>
> I had similar errors in pools on SSD until I upgraded to nautilus (clean
> bluestore installation)
>
> - Original Message -
> > From: "M Ranga Swami Reddy" 
> > To: "ceph-users" , "ceph-devel" <
> ceph-de...@vger.kernel.org>
> > Sent: Monday, 25 November, 2019 12:04:46
> > Subject: [ceph-users] scrub errors on rgw data pool
>
> > Hello - We are using the ceph 12.2.11 version (upgraded from Jewel
> 10.2.12 to
> > 12.2.11). In this cluster, we are having mix of filestore and bluestore
> OSD
> > backends.
> > Recently we are seeing the scrub errors on rgw buckets.data pool every
> day,
> > after scrub operation performed by Ceph. If we run the PG repair, the
> errors
> > will go way.
> >
> > Anyone seen the above issue?
> > Is the mix of filestore backend has bug/issue with 12.2.11 version (ie
> > Luminous).
> > Is the mix of filestore and bluestore OSDs cause this type of issue?
> >
> > Thanks
> > Swami
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors on rgw data pool

2019-11-25 Thread Fyodor Ustinov
Hi!

I had similar errors in pools on SSD until I upgraded to nautilus (clean 
bluestore installation)

- Original Message -
> From: "M Ranga Swami Reddy" 
> To: "ceph-users" , "ceph-devel" 
> 
> Sent: Monday, 25 November, 2019 12:04:46
> Subject: [ceph-users] scrub errors on rgw data pool

> Hello - We are using the ceph 12.2.11 version (upgraded from Jewel 10.2.12 to
> 12.2.11). In this cluster, we are having mix of filestore and bluestore OSD
> backends.
> Recently we are seeing the scrub errors on rgw buckets.data pool every day,
> after scrub operation performed by Ceph. If we run the PG repair, the errors
> will go way.
> 
> Anyone seen the above issue?
> Is the mix of filestore backend has bug/issue with 12.2.11 version (ie
> Luminous).
> Is the mix of filestore and bluestore OSDs cause this type of issue?
> 
> Thanks
> Swami
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub errors on rgw data pool

2019-11-25 Thread M Ranga Swami Reddy
Hello - We are using the ceph 12.2.11 version (upgraded from Jewel 10.2.12
to 12.2.11). In this cluster, we are having mix of filestore and bluestore
OSD backends.
Recently we are seeing the scrub errors on rgw buckets.data pool every day,
after scrub operation performed by Ceph. If we run the PG repair, the
errors will go way.

Anyone seen the above issue?
Is the mix of filestore backend has bug/issue with 12.2.11 version (ie
Luminous).
Is the mix of filestore and bluestore OSDs cause this type of issue?

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub errors because of missing shards on luminous

2019-09-19 Thread Mattia Belluco
Dear ml,

we are currently trying to wrap our heads around a HEALTH_ERR problem on
our Luminous 12.2.12 cluster (upgraded from Jewel a couple of weeks
ago). Before attempting a 'ceph pg repair' we would like to have a
better understanding of what has happened.

ceph -s reports:

cluster:
id: 7705608d-cbef-477a-865d-f5ae4c03370a
health: HEALTH_ERR
14 scrub errors
Possible data damage: 5 pgs inconsistent

  services:
mon: 5 daemons, quorum mon-k3-38,mon-k5-41,mon-l2-40,mon-l7-40,mon-k4-32
mgr: mon-l2-40(active), standbys: mon-l7-40, mon-k3-38, mon-k5-41,
mon-k4-32
osd: 1332 osds: 1332 up, 1332 in

  data:
pools:   4 pools, 49160 pgs
objects: 342.85M objects, 1.23PiB
usage:   3.73PiB used, 1.39PiB / 5.11PiB avail
pgs: 49112 active+clean
 34active+clean+scrubbing+deep
 9 active+clean+scrubbing
 5 active+clean+inconsistent

  io:
client:   450MiB/s rd, 411MiB/s wr, 5.10kop/s rd, 2.33kop/s wr


It all seem to have originated from the deletion of an old RBD image
snapshot (cinder volume snapshot) that caused:
- an unusual load spike on most OSD nodes.
- a number of osds being concurrently marked down by mon "despite being
running"

We are now observing an increasing number of scrub errors as
deep-scrubbing progresses, with all inconsistent PGs belonging to the
same pool and all the missing objects to the same RBD image whose
snapshot has been deleted.

Attempting to troubleshoot the situation we realized we don't have a
clear understanding of how an image with snapshots evolves over time:
a RBD image has a preset number of objects but once a snapshot has been
taken and the image starts to diverge from the snapshot more objects
will eventually be needed, correct?

I am attaching the output of 'rados list-inconsistent-obj' ran against
the inconsistent PGs.

We can see two kinds of errors:
- "union_shard_errors": "missing"
- "errors:" "object_info_inconsistency",
"snapset_inconsistency"

The pool has size=3 and min_size=2, the image size is 5TB with 4MB objects.

Has anyone experienced a similar issue? I could not find anything
relevant in the issue tracker but I'll be happy to open a case if this
turns out to be a bug.

Thanks in advance for any hints,

Kind regards,
Mattia Belluco
{
"epoch": 381704,
"inconsistents": [
{
"object": {
"name": "rbd_data.20c737083e5fdc6e.000dc733",
"nspace": "",
"locator": "",
"snap": 16970,
"version": 4532671
},
"errors": [],
"union_shard_errors": [
"missing"
],
"selected_object_info": {
"oid": {
"oid": "rbd_data.20c737083e5fdc6e.000dc733",
"key": "",
"snapid": 16970,
"hash": 3945368553,
"max": 0,
"pool": 36,
"namespace": ""
},
"version": "326149'4861369",
"prior_version": "262931'4535183",
"last_reqid": "osd.987.0:56883287",
"user_version": 4532671,
"size": 4194304,
"mtime": "2019-03-06 16:04:40.619294",
"local_mtime": "2019-03-06 16:04:40.630542",
"lost": 0,
"flags": [
"dirty",
"data_digest",
"omap_digest"
],
"legacy_snaps": [
16970
],
"truncate_seq": 0,
"truncate_size": 0,
"data_digest": "0x1b5e4a89",
"omap_digest": "0x",
"expected_object_size": 0,
"expected_write_size": 0,
"alloc_hint_flags": 0,
"manifest": {
"type": 0,
"redirect_target": {
"oid": "",
"key": "",
"snapid": 0,
"hash": 0,
"max": 0,
"pool": -9223372036854775808,
"namespace": ""
}
},
"watchers": {}
},
"shards": [
{
"osd": 190,
"primary": false,
"errors": [],
"size": 4194304
},
{
"osd": 1254,
"primary": true,
"errors": [
"missing"
]
},
{
"osd": 1317,
"primary": false,
"errors": [],
"size": 4194304

Re: [ceph-users] scrub errors

2019-03-28 Thread Brad Hubbard
On Fri, Mar 29, 2019 at 7:54 AM solarflow99  wrote:
>
> ok, I tried doing ceph osd out on each of the 4 OSDs 1 by 1.  I got it out of 
> backfill mode but still not sure if it'll fix anything.  pg 10.2a still shows 
> state active+clean+inconsistent.  Peer 8  is now 
> remapped+inconsistent+peering, and the other peer is active+clean+inconsistent

Per the document I linked previously if a pg remains remapped you
likely have a problem with your configuration. Take a good look at
your crushmap, pg distribution, pool configuration, etc.

>
>
> On Wed, Mar 27, 2019 at 4:13 PM Brad Hubbard  wrote:
>>
>> On Thu, Mar 28, 2019 at 8:33 AM solarflow99  wrote:
>> >
>> > yes, but nothing seems to happen.  I don't understand why it lists OSDs 7 
>> > in the  "recovery_state": when i'm only using 3 replicas and it seems to 
>> > use 41,38,8
>>
>> Well, osd 8s state is listed as
>> "active+undersized+degraded+remapped+wait_backfill" so it seems to be
>> stuck waiting for backfill for some reason. One thing you could try is
>> restarting all of the osds including 7 and 17 to see if forcing them
>> to peer again has any positive effect. Don't restart them all at once,
>> just one at a time waiting until each has peered before moving on.
>>
>> >
>> > # ceph health detail
>> > HEALTH_ERR 1 pgs inconsistent; 47 scrub errors
>> > pg 10.2a is active+clean+inconsistent, acting [41,38,8]
>> > 47 scrub errors
>> >
>> >
>> >
>> > As you can see all OSDs are up and in:
>> >
>> > # ceph osd stat
>> >  osdmap e23265: 49 osds: 49 up, 49 in
>> >
>> >
>> >
>> >
>> > And this just stays the same:
>> >
>> > "up": [
>> > 41,
>> > 38,
>> > 8
>> > ],
>> > "acting": [
>> > 41,
>> > 38,
>> > 8
>> >
>> >  "recovery_state": [
>> > {
>> > "name": "Started\/Primary\/Active",
>> > "enter_time": "2018-09-22 07:07:48.637248",
>> > "might_have_unfound": [
>> > {
>> > "osd": "7",
>> > "status": "not queried"
>> > },
>> > {
>> > "osd": "8",
>> > "status": "already probed"
>> > },
>> > {
>> > "osd": "17",
>> > "status": "not queried"
>> > },
>> > {
>> > "osd": "38",
>> > "status": "already probed"
>> > }
>> > ],
>> >
>> >
>> > On Tue, Mar 26, 2019 at 4:53 PM Brad Hubbard  wrote:
>> >>
>> >> http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
>> >>
>> >> Did you try repairing the pg?
>> >>
>> >>
>> >> On Tue, Mar 26, 2019 at 9:08 AM solarflow99  wrote:
>> >> >
>> >> > yes, I know its old.  I intend to have it replaced but thats a few 
>> >> > months away and was hoping to get past this.  the other OSDs appear to 
>> >> > be ok, I see them up and in, why do you see something wrong?
>> >> >
>> >> > On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard  
>> >> > wrote:
>> >> >>
>> >> >> Hammer is no longer supported.
>> >> >>
>> >> >> What's the status of osds 7 and 17?
>> >> >>
>> >> >> On Tue, Mar 26, 2019 at 8:56 AM solarflow99  
>> >> >> wrote:
>> >> >> >
>> >> >> > hi, thanks.  Its still using Hammer.  Here's the output from the pg 
>> >> >> > query, the last command you gave doesn't work at all but be too old.
>> >> >> >
>> >> >> >
>> >> >> > # ceph pg 10.2a query
>> >> >> > {
>> >> >> > "state": "active+clean+inconsistent",
>> >> >> > "snap_trimq": "[]",
>> >> >> > "epoch": 23265,
>> >> >> > "up": [
>> >> >> > 41,
>> >> >> > 38,
>> >> >> > 8
>> >> >> > ],
>> >> >> > "acting": [
>> >> >> > 41,
>> >> >> > 38,
>> >> >> > 8
>> >> >> > ],
>> >> >> > "actingbackfill": [
>> >> >> > "8",
>> >> >> > "38",
>> >> >> > "41"
>> >> >> > ],
>> >> >> > "info": {
>> >> >> > "pgid": "10.2a",
>> >> >> > "last_update": "23265'20886859",
>> >> >> > "last_complete": "23265'20886859",
>> >> >> > "log_tail": "23265'20883809",
>> >> >> > "last_user_version": 20886859,
>> >> >> > "last_backfill": "MAX",
>> >> >> > "purged_snaps": "[]",
>> >> >> > "history": {
>> >> >> > "epoch_created": 8200,
>> >> >> > "last_epoch_started": 21481,
>> >> >> > "last_epoch_clean": 21487,
>> >> >> > "last_epoch_split": 0,
>> >> >> > "same_up_since": 21472,
>> >> >> > "same_interval_since": 21474,
>> >> >> > "same_primary_since": 8244,
>> >> >> > "last_scrub": "23265'20864209",
>> >> >> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
>> >> >> > "last_deep_scrub": "23265'20864209",
>> >> >> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
>> >> >> > 

Re: [ceph-users] scrub errors

2019-03-28 Thread solarflow99
ok, I tried doing ceph osd out on each of the 4 OSDs 1 by 1.  I got it out
of backfill mode but still not sure if it'll fix anything.  pg 10.2a still
shows state active+clean+inconsistent.  Peer 8  is now
remapped+inconsistent+peering, and the other peer is
active+clean+inconsistent


On Wed, Mar 27, 2019 at 4:13 PM Brad Hubbard  wrote:

> On Thu, Mar 28, 2019 at 8:33 AM solarflow99  wrote:
> >
> > yes, but nothing seems to happen.  I don't understand why it lists OSDs
> 7 in the  "recovery_state": when i'm only using 3 replicas and it seems to
> use 41,38,8
>
> Well, osd 8s state is listed as
> "active+undersized+degraded+remapped+wait_backfill" so it seems to be
> stuck waiting for backfill for some reason. One thing you could try is
> restarting all of the osds including 7 and 17 to see if forcing them
> to peer again has any positive effect. Don't restart them all at once,
> just one at a time waiting until each has peered before moving on.
>
> >
> > # ceph health detail
> > HEALTH_ERR 1 pgs inconsistent; 47 scrub errors
> > pg 10.2a is active+clean+inconsistent, acting [41,38,8]
> > 47 scrub errors
> >
> >
> >
> > As you can see all OSDs are up and in:
> >
> > # ceph osd stat
> >  osdmap e23265: 49 osds: 49 up, 49 in
> >
> >
> >
> >
> > And this just stays the same:
> >
> > "up": [
> > 41,
> > 38,
> > 8
> > ],
> > "acting": [
> > 41,
> > 38,
> > 8
> >
> >  "recovery_state": [
> > {
> > "name": "Started\/Primary\/Active",
> > "enter_time": "2018-09-22 07:07:48.637248",
> > "might_have_unfound": [
> > {
> > "osd": "7",
> > "status": "not queried"
> > },
> > {
> > "osd": "8",
> > "status": "already probed"
> > },
> > {
> > "osd": "17",
> > "status": "not queried"
> > },
> > {
> > "osd": "38",
> > "status": "already probed"
> > }
> > ],
> >
> >
> > On Tue, Mar 26, 2019 at 4:53 PM Brad Hubbard 
> wrote:
> >>
> >>
> http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
> >>
> >> Did you try repairing the pg?
> >>
> >>
> >> On Tue, Mar 26, 2019 at 9:08 AM solarflow99 
> wrote:
> >> >
> >> > yes, I know its old.  I intend to have it replaced but thats a few
> months away and was hoping to get past this.  the other OSDs appear to be
> ok, I see them up and in, why do you see something wrong?
> >> >
> >> > On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard 
> wrote:
> >> >>
> >> >> Hammer is no longer supported.
> >> >>
> >> >> What's the status of osds 7 and 17?
> >> >>
> >> >> On Tue, Mar 26, 2019 at 8:56 AM solarflow99 
> wrote:
> >> >> >
> >> >> > hi, thanks.  Its still using Hammer.  Here's the output from the
> pg query, the last command you gave doesn't work at all but be too old.
> >> >> >
> >> >> >
> >> >> > # ceph pg 10.2a query
> >> >> > {
> >> >> > "state": "active+clean+inconsistent",
> >> >> > "snap_trimq": "[]",
> >> >> > "epoch": 23265,
> >> >> > "up": [
> >> >> > 41,
> >> >> > 38,
> >> >> > 8
> >> >> > ],
> >> >> > "acting": [
> >> >> > 41,
> >> >> > 38,
> >> >> > 8
> >> >> > ],
> >> >> > "actingbackfill": [
> >> >> > "8",
> >> >> > "38",
> >> >> > "41"
> >> >> > ],
> >> >> > "info": {
> >> >> > "pgid": "10.2a",
> >> >> > "last_update": "23265'20886859",
> >> >> > "last_complete": "23265'20886859",
> >> >> > "log_tail": "23265'20883809",
> >> >> > "last_user_version": 20886859,
> >> >> > "last_backfill": "MAX",
> >> >> > "purged_snaps": "[]",
> >> >> > "history": {
> >> >> > "epoch_created": 8200,
> >> >> > "last_epoch_started": 21481,
> >> >> > "last_epoch_clean": 21487,
> >> >> > "last_epoch_split": 0,
> >> >> > "same_up_since": 21472,
> >> >> > "same_interval_since": 21474,
> >> >> > "same_primary_since": 8244,
> >> >> > "last_scrub": "23265'20864209",
> >> >> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> >> >> > "last_deep_scrub": "23265'20864209",
> >> >> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> >> >> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
> >> >> > },
> >> >> > "stats": {
> >> >> > "version": "23265'20886859",
> >> >> > "reported_seq": "10109937",
> >> >> > "reported_epoch": "23265",
> >> >> > "state": "active+clean+inconsistent",
> >> >> > "last_fresh": "2019-03-25 15:52:53.720768",
> >> >> > "last_change": "2019-03-22 

Re: [ceph-users] scrub errors

2019-03-27 Thread Brad Hubbard
On Thu, Mar 28, 2019 at 8:33 AM solarflow99  wrote:
>
> yes, but nothing seems to happen.  I don't understand why it lists OSDs 7 in 
> the  "recovery_state": when i'm only using 3 replicas and it seems to use 
> 41,38,8

Well, osd 8s state is listed as
"active+undersized+degraded+remapped+wait_backfill" so it seems to be
stuck waiting for backfill for some reason. One thing you could try is
restarting all of the osds including 7 and 17 to see if forcing them
to peer again has any positive effect. Don't restart them all at once,
just one at a time waiting until each has peered before moving on.

>
> # ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 47 scrub errors
> pg 10.2a is active+clean+inconsistent, acting [41,38,8]
> 47 scrub errors
>
>
>
> As you can see all OSDs are up and in:
>
> # ceph osd stat
>  osdmap e23265: 49 osds: 49 up, 49 in
>
>
>
>
> And this just stays the same:
>
> "up": [
> 41,
> 38,
> 8
> ],
> "acting": [
> 41,
> 38,
> 8
>
>  "recovery_state": [
> {
> "name": "Started\/Primary\/Active",
> "enter_time": "2018-09-22 07:07:48.637248",
> "might_have_unfound": [
> {
> "osd": "7",
> "status": "not queried"
> },
> {
> "osd": "8",
> "status": "already probed"
> },
> {
> "osd": "17",
> "status": "not queried"
> },
> {
> "osd": "38",
> "status": "already probed"
> }
> ],
>
>
> On Tue, Mar 26, 2019 at 4:53 PM Brad Hubbard  wrote:
>>
>> http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
>>
>> Did you try repairing the pg?
>>
>>
>> On Tue, Mar 26, 2019 at 9:08 AM solarflow99  wrote:
>> >
>> > yes, I know its old.  I intend to have it replaced but thats a few months 
>> > away and was hoping to get past this.  the other OSDs appear to be ok, I 
>> > see them up and in, why do you see something wrong?
>> >
>> > On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard  wrote:
>> >>
>> >> Hammer is no longer supported.
>> >>
>> >> What's the status of osds 7 and 17?
>> >>
>> >> On Tue, Mar 26, 2019 at 8:56 AM solarflow99  wrote:
>> >> >
>> >> > hi, thanks.  Its still using Hammer.  Here's the output from the pg 
>> >> > query, the last command you gave doesn't work at all but be too old.
>> >> >
>> >> >
>> >> > # ceph pg 10.2a query
>> >> > {
>> >> > "state": "active+clean+inconsistent",
>> >> > "snap_trimq": "[]",
>> >> > "epoch": 23265,
>> >> > "up": [
>> >> > 41,
>> >> > 38,
>> >> > 8
>> >> > ],
>> >> > "acting": [
>> >> > 41,
>> >> > 38,
>> >> > 8
>> >> > ],
>> >> > "actingbackfill": [
>> >> > "8",
>> >> > "38",
>> >> > "41"
>> >> > ],
>> >> > "info": {
>> >> > "pgid": "10.2a",
>> >> > "last_update": "23265'20886859",
>> >> > "last_complete": "23265'20886859",
>> >> > "log_tail": "23265'20883809",
>> >> > "last_user_version": 20886859,
>> >> > "last_backfill": "MAX",
>> >> > "purged_snaps": "[]",
>> >> > "history": {
>> >> > "epoch_created": 8200,
>> >> > "last_epoch_started": 21481,
>> >> > "last_epoch_clean": 21487,
>> >> > "last_epoch_split": 0,
>> >> > "same_up_since": 21472,
>> >> > "same_interval_since": 21474,
>> >> > "same_primary_since": 8244,
>> >> > "last_scrub": "23265'20864209",
>> >> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
>> >> > "last_deep_scrub": "23265'20864209",
>> >> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
>> >> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
>> >> > },
>> >> > "stats": {
>> >> > "version": "23265'20886859",
>> >> > "reported_seq": "10109937",
>> >> > "reported_epoch": "23265",
>> >> > "state": "active+clean+inconsistent",
>> >> > "last_fresh": "2019-03-25 15:52:53.720768",
>> >> > "last_change": "2019-03-22 22:39:13.931038",
>> >> > "last_active": "2019-03-25 15:52:53.720768",
>> >> > "last_peered": "2019-03-25 15:52:53.720768",
>> >> > "last_clean": "2019-03-25 15:52:53.720768",
>> >> > "last_became_active": "0.00",
>> >> > "last_became_peered": "0.00",
>> >> > "last_unstale": "2019-03-25 15:52:53.720768",
>> >> > "last_undegraded": "2019-03-25 15:52:53.720768",
>> >> > "last_fullsized": "2019-03-25 15:52:53.720768",
>> >> > "mapping_epoch": 21472,
>> >> > "log_start": 

Re: [ceph-users] scrub errors

2019-03-27 Thread solarflow99
yes, but nothing seems to happen.  I don't understand why it lists OSDs 7
in the  "recovery_state": when i'm only using 3 replicas and it seems to
use 41,38,8

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 47 scrub errors
pg 10.2a is active+clean+inconsistent, acting [41,38,8]
47 scrub errors



As you can see all OSDs are up and in:

# ceph osd stat
 osdmap e23265: 49 osds: 49 up, 49 in




And this just stays the same:

"up": [
41,
38,
8
],
"acting": [
41,
38,
8

 "recovery_state": [
{
"name": "Started\/Primary\/Active",
"enter_time": "2018-09-22 07:07:48.637248",
"might_have_unfound": [
{
"osd": "7",
"status": "not queried"
},
{
"osd": "8",
"status": "already probed"
},
{
"osd": "17",
"status": "not queried"
},
{
"osd": "38",
"status": "already probed"
}
],


On Tue, Mar 26, 2019 at 4:53 PM Brad Hubbard  wrote:

> http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
>
> Did you try repairing the pg?
>
>
> On Tue, Mar 26, 2019 at 9:08 AM solarflow99  wrote:
> >
> > yes, I know its old.  I intend to have it replaced but thats a few
> months away and was hoping to get past this.  the other OSDs appear to be
> ok, I see them up and in, why do you see something wrong?
> >
> > On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard 
> wrote:
> >>
> >> Hammer is no longer supported.
> >>
> >> What's the status of osds 7 and 17?
> >>
> >> On Tue, Mar 26, 2019 at 8:56 AM solarflow99 
> wrote:
> >> >
> >> > hi, thanks.  Its still using Hammer.  Here's the output from the pg
> query, the last command you gave doesn't work at all but be too old.
> >> >
> >> >
> >> > # ceph pg 10.2a query
> >> > {
> >> > "state": "active+clean+inconsistent",
> >> > "snap_trimq": "[]",
> >> > "epoch": 23265,
> >> > "up": [
> >> > 41,
> >> > 38,
> >> > 8
> >> > ],
> >> > "acting": [
> >> > 41,
> >> > 38,
> >> > 8
> >> > ],
> >> > "actingbackfill": [
> >> > "8",
> >> > "38",
> >> > "41"
> >> > ],
> >> > "info": {
> >> > "pgid": "10.2a",
> >> > "last_update": "23265'20886859",
> >> > "last_complete": "23265'20886859",
> >> > "log_tail": "23265'20883809",
> >> > "last_user_version": 20886859,
> >> > "last_backfill": "MAX",
> >> > "purged_snaps": "[]",
> >> > "history": {
> >> > "epoch_created": 8200,
> >> > "last_epoch_started": 21481,
> >> > "last_epoch_clean": 21487,
> >> > "last_epoch_split": 0,
> >> > "same_up_since": 21472,
> >> > "same_interval_since": 21474,
> >> > "same_primary_since": 8244,
> >> > "last_scrub": "23265'20864209",
> >> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> >> > "last_deep_scrub": "23265'20864209",
> >> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> >> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
> >> > },
> >> > "stats": {
> >> > "version": "23265'20886859",
> >> > "reported_seq": "10109937",
> >> > "reported_epoch": "23265",
> >> > "state": "active+clean+inconsistent",
> >> > "last_fresh": "2019-03-25 15:52:53.720768",
> >> > "last_change": "2019-03-22 22:39:13.931038",
> >> > "last_active": "2019-03-25 15:52:53.720768",
> >> > "last_peered": "2019-03-25 15:52:53.720768",
> >> > "last_clean": "2019-03-25 15:52:53.720768",
> >> > "last_became_active": "0.00",
> >> > "last_became_peered": "0.00",
> >> > "last_unstale": "2019-03-25 15:52:53.720768",
> >> > "last_undegraded": "2019-03-25 15:52:53.720768",
> >> > "last_fullsized": "2019-03-25 15:52:53.720768",
> >> > "mapping_epoch": 21472,
> >> > "log_start": "23265'20883809",
> >> > "ondisk_log_start": "23265'20883809",
> >> > "created": 8200,
> >> > "last_epoch_clean": 21487,
> >> > "parent": "0.0",
> >> > "parent_split_bits": 0,
> >> > "last_scrub": "23265'20864209",
> >> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> >> > "last_deep_scrub": "23265'20864209",
> >> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> >> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438",
> >> > "log_size": 3050,
> >> > "ondisk_log_size": 

Re: [ceph-users] scrub errors

2019-03-26 Thread Brad Hubbard
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/

Did you try repairing the pg?


On Tue, Mar 26, 2019 at 9:08 AM solarflow99  wrote:
>
> yes, I know its old.  I intend to have it replaced but thats a few months 
> away and was hoping to get past this.  the other OSDs appear to be ok, I see 
> them up and in, why do you see something wrong?
>
> On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard  wrote:
>>
>> Hammer is no longer supported.
>>
>> What's the status of osds 7 and 17?
>>
>> On Tue, Mar 26, 2019 at 8:56 AM solarflow99  wrote:
>> >
>> > hi, thanks.  Its still using Hammer.  Here's the output from the pg query, 
>> > the last command you gave doesn't work at all but be too old.
>> >
>> >
>> > # ceph pg 10.2a query
>> > {
>> > "state": "active+clean+inconsistent",
>> > "snap_trimq": "[]",
>> > "epoch": 23265,
>> > "up": [
>> > 41,
>> > 38,
>> > 8
>> > ],
>> > "acting": [
>> > 41,
>> > 38,
>> > 8
>> > ],
>> > "actingbackfill": [
>> > "8",
>> > "38",
>> > "41"
>> > ],
>> > "info": {
>> > "pgid": "10.2a",
>> > "last_update": "23265'20886859",
>> > "last_complete": "23265'20886859",
>> > "log_tail": "23265'20883809",
>> > "last_user_version": 20886859,
>> > "last_backfill": "MAX",
>> > "purged_snaps": "[]",
>> > "history": {
>> > "epoch_created": 8200,
>> > "last_epoch_started": 21481,
>> > "last_epoch_clean": 21487,
>> > "last_epoch_split": 0,
>> > "same_up_since": 21472,
>> > "same_interval_since": 21474,
>> > "same_primary_since": 8244,
>> > "last_scrub": "23265'20864209",
>> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
>> > "last_deep_scrub": "23265'20864209",
>> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
>> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
>> > },
>> > "stats": {
>> > "version": "23265'20886859",
>> > "reported_seq": "10109937",
>> > "reported_epoch": "23265",
>> > "state": "active+clean+inconsistent",
>> > "last_fresh": "2019-03-25 15:52:53.720768",
>> > "last_change": "2019-03-22 22:39:13.931038",
>> > "last_active": "2019-03-25 15:52:53.720768",
>> > "last_peered": "2019-03-25 15:52:53.720768",
>> > "last_clean": "2019-03-25 15:52:53.720768",
>> > "last_became_active": "0.00",
>> > "last_became_peered": "0.00",
>> > "last_unstale": "2019-03-25 15:52:53.720768",
>> > "last_undegraded": "2019-03-25 15:52:53.720768",
>> > "last_fullsized": "2019-03-25 15:52:53.720768",
>> > "mapping_epoch": 21472,
>> > "log_start": "23265'20883809",
>> > "ondisk_log_start": "23265'20883809",
>> > "created": 8200,
>> > "last_epoch_clean": 21487,
>> > "parent": "0.0",
>> > "parent_split_bits": 0,
>> > "last_scrub": "23265'20864209",
>> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
>> > "last_deep_scrub": "23265'20864209",
>> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
>> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438",
>> > "log_size": 3050,
>> > "ondisk_log_size": 3050,
>> > "stats_invalid": "0",
>> > "stat_sum": {
>> > "num_bytes": 8220278746,
>> > "num_objects": 345034,
>> > "num_object_clones": 0,
>> > "num_object_copies": 1035102,
>> > "num_objects_missing_on_primary": 0,
>> > "num_objects_degraded": 0,
>> > "num_objects_misplaced": 0,
>> > "num_objects_unfound": 0,
>> > "num_objects_dirty": 345034,
>> > "num_whiteouts": 0,
>> > "num_read": 7904350,
>> > "num_read_kb": 58116568,
>> > "num_write": 8753504,
>> > "num_write_kb": 85104263,
>> > "num_scrub_errors": 47,
>> > "num_shallow_scrub_errors": 47,
>> > "num_deep_scrub_errors": 0,
>> > "num_objects_recovered": 167138,
>> > "num_bytes_recovered": 5193543924,
>> > "num_keys_recovered": 0,
>> > "num_objects_omap": 0,
>> > "num_objects_hit_set_archive": 0,
>> > "num_bytes_hit_set_archive": 0
>> > },
>> > "up": [
>> > 41,
>> > 38,
>> > 8
>> > ],
>> > "acting": [
>> > 41,
>> > 38,
>> >

Re: [ceph-users] scrub errors

2019-03-25 Thread solarflow99
yes, I know its old.  I intend to have it replaced but thats a few months
away and was hoping to get past this.  the other OSDs appear to be ok, I
see them up and in, why do you see something wrong?

On Mon, Mar 25, 2019 at 4:00 PM Brad Hubbard  wrote:

> Hammer is no longer supported.
>
> What's the status of osds 7 and 17?
>
> On Tue, Mar 26, 2019 at 8:56 AM solarflow99  wrote:
> >
> > hi, thanks.  Its still using Hammer.  Here's the output from the pg
> query, the last command you gave doesn't work at all but be too old.
> >
> >
> > # ceph pg 10.2a query
> > {
> > "state": "active+clean+inconsistent",
> > "snap_trimq": "[]",
> > "epoch": 23265,
> > "up": [
> > 41,
> > 38,
> > 8
> > ],
> > "acting": [
> > 41,
> > 38,
> > 8
> > ],
> > "actingbackfill": [
> > "8",
> > "38",
> > "41"
> > ],
> > "info": {
> > "pgid": "10.2a",
> > "last_update": "23265'20886859",
> > "last_complete": "23265'20886859",
> > "log_tail": "23265'20883809",
> > "last_user_version": 20886859,
> > "last_backfill": "MAX",
> > "purged_snaps": "[]",
> > "history": {
> > "epoch_created": 8200,
> > "last_epoch_started": 21481,
> > "last_epoch_clean": 21487,
> > "last_epoch_split": 0,
> > "same_up_since": 21472,
> > "same_interval_since": 21474,
> > "same_primary_since": 8244,
> > "last_scrub": "23265'20864209",
> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> > "last_deep_scrub": "23265'20864209",
> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
> > },
> > "stats": {
> > "version": "23265'20886859",
> > "reported_seq": "10109937",
> > "reported_epoch": "23265",
> > "state": "active+clean+inconsistent",
> > "last_fresh": "2019-03-25 15:52:53.720768",
> > "last_change": "2019-03-22 22:39:13.931038",
> > "last_active": "2019-03-25 15:52:53.720768",
> > "last_peered": "2019-03-25 15:52:53.720768",
> > "last_clean": "2019-03-25 15:52:53.720768",
> > "last_became_active": "0.00",
> > "last_became_peered": "0.00",
> > "last_unstale": "2019-03-25 15:52:53.720768",
> > "last_undegraded": "2019-03-25 15:52:53.720768",
> > "last_fullsized": "2019-03-25 15:52:53.720768",
> > "mapping_epoch": 21472,
> > "log_start": "23265'20883809",
> > "ondisk_log_start": "23265'20883809",
> > "created": 8200,
> > "last_epoch_clean": 21487,
> > "parent": "0.0",
> > "parent_split_bits": 0,
> > "last_scrub": "23265'20864209",
> > "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> > "last_deep_scrub": "23265'20864209",
> > "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> > "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438",
> > "log_size": 3050,
> > "ondisk_log_size": 3050,
> > "stats_invalid": "0",
> > "stat_sum": {
> > "num_bytes": 8220278746,
> > "num_objects": 345034,
> > "num_object_clones": 0,
> > "num_object_copies": 1035102,
> > "num_objects_missing_on_primary": 0,
> > "num_objects_degraded": 0,
> > "num_objects_misplaced": 0,
> > "num_objects_unfound": 0,
> > "num_objects_dirty": 345034,
> > "num_whiteouts": 0,
> > "num_read": 7904350,
> > "num_read_kb": 58116568,
> > "num_write": 8753504,
> > "num_write_kb": 85104263,
> > "num_scrub_errors": 47,
> > "num_shallow_scrub_errors": 47,
> > "num_deep_scrub_errors": 0,
> > "num_objects_recovered": 167138,
> > "num_bytes_recovered": 5193543924,
> > "num_keys_recovered": 0,
> > "num_objects_omap": 0,
> > "num_objects_hit_set_archive": 0,
> > "num_bytes_hit_set_archive": 0
> > },
> > "up": [
> > 41,
> > 38,
> > 8
> > ],
> > "acting": [
> > 41,
> > 38,
> > 8
> > ],
> > "blocked_by": [],
> > "up_primary": 41,
> > "acting_primary": 41
> > },
> > "empty": 0,
> > "dne": 0,
> > "incomplete": 0,
> > "last_epoch_started": 21481,
> > "hit_set_history": 

Re: [ceph-users] scrub errors

2019-03-25 Thread Brad Hubbard
Hammer is no longer supported.

What's the status of osds 7 and 17?

On Tue, Mar 26, 2019 at 8:56 AM solarflow99  wrote:
>
> hi, thanks.  Its still using Hammer.  Here's the output from the pg query, 
> the last command you gave doesn't work at all but be too old.
>
>
> # ceph pg 10.2a query
> {
> "state": "active+clean+inconsistent",
> "snap_trimq": "[]",
> "epoch": 23265,
> "up": [
> 41,
> 38,
> 8
> ],
> "acting": [
> 41,
> 38,
> 8
> ],
> "actingbackfill": [
> "8",
> "38",
> "41"
> ],
> "info": {
> "pgid": "10.2a",
> "last_update": "23265'20886859",
> "last_complete": "23265'20886859",
> "log_tail": "23265'20883809",
> "last_user_version": 20886859,
> "last_backfill": "MAX",
> "purged_snaps": "[]",
> "history": {
> "epoch_created": 8200,
> "last_epoch_started": 21481,
> "last_epoch_clean": 21487,
> "last_epoch_split": 0,
> "same_up_since": 21472,
> "same_interval_since": 21474,
> "same_primary_since": 8244,
> "last_scrub": "23265'20864209",
> "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> "last_deep_scrub": "23265'20864209",
> "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
> },
> "stats": {
> "version": "23265'20886859",
> "reported_seq": "10109937",
> "reported_epoch": "23265",
> "state": "active+clean+inconsistent",
> "last_fresh": "2019-03-25 15:52:53.720768",
> "last_change": "2019-03-22 22:39:13.931038",
> "last_active": "2019-03-25 15:52:53.720768",
> "last_peered": "2019-03-25 15:52:53.720768",
> "last_clean": "2019-03-25 15:52:53.720768",
> "last_became_active": "0.00",
> "last_became_peered": "0.00",
> "last_unstale": "2019-03-25 15:52:53.720768",
> "last_undegraded": "2019-03-25 15:52:53.720768",
> "last_fullsized": "2019-03-25 15:52:53.720768",
> "mapping_epoch": 21472,
> "log_start": "23265'20883809",
> "ondisk_log_start": "23265'20883809",
> "created": 8200,
> "last_epoch_clean": 21487,
> "parent": "0.0",
> "parent_split_bits": 0,
> "last_scrub": "23265'20864209",
> "last_scrub_stamp": "2019-03-22 22:39:13.930673",
> "last_deep_scrub": "23265'20864209",
> "last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
> "last_clean_scrub_stamp": "2019-03-15 01:33:21.447438",
> "log_size": 3050,
> "ondisk_log_size": 3050,
> "stats_invalid": "0",
> "stat_sum": {
> "num_bytes": 8220278746,
> "num_objects": 345034,
> "num_object_clones": 0,
> "num_object_copies": 1035102,
> "num_objects_missing_on_primary": 0,
> "num_objects_degraded": 0,
> "num_objects_misplaced": 0,
> "num_objects_unfound": 0,
> "num_objects_dirty": 345034,
> "num_whiteouts": 0,
> "num_read": 7904350,
> "num_read_kb": 58116568,
> "num_write": 8753504,
> "num_write_kb": 85104263,
> "num_scrub_errors": 47,
> "num_shallow_scrub_errors": 47,
> "num_deep_scrub_errors": 0,
> "num_objects_recovered": 167138,
> "num_bytes_recovered": 5193543924,
> "num_keys_recovered": 0,
> "num_objects_omap": 0,
> "num_objects_hit_set_archive": 0,
> "num_bytes_hit_set_archive": 0
> },
> "up": [
> 41,
> 38,
> 8
> ],
> "acting": [
> 41,
> 38,
> 8
> ],
> "blocked_by": [],
> "up_primary": 41,
> "acting_primary": 41
> },
> "empty": 0,
> "dne": 0,
> "incomplete": 0,
> "last_epoch_started": 21481,
> "hit_set_history": {
> "current_last_update": "0'0",
> "current_last_stamp": "0.00",
> "current_info": {
> "begin": "0.00",
> "end": "0.00",
> "version": "0'0",
> "using_gmt": "0"
> },
> "history": []
> }
> },
> "peer_info": [
> {
> "peer": "8",
> "pgid": "10.2a",
> "last_update": "23265'20886859",
> 

Re: [ceph-users] scrub errors

2019-03-25 Thread solarflow99
hi, thanks.  Its still using Hammer.  Here's the output from the pg query,
the last command you gave doesn't work at all but be too old.


# ceph pg 10.2a query
{
"state": "active+clean+inconsistent",
"snap_trimq": "[]",
"epoch": 23265,
"up": [
41,
38,
8
],
"acting": [
41,
38,
8
],
"actingbackfill": [
"8",
"38",
"41"
],
"info": {
"pgid": "10.2a",
"last_update": "23265'20886859",
"last_complete": "23265'20886859",
"log_tail": "23265'20883809",
"last_user_version": 20886859,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 8200,
"last_epoch_started": 21481,
"last_epoch_clean": 21487,
"last_epoch_split": 0,
"same_up_since": 21472,
"same_interval_since": 21474,
"same_primary_since": 8244,
"last_scrub": "23265'20864209",
"last_scrub_stamp": "2019-03-22 22:39:13.930673",
"last_deep_scrub": "23265'20864209",
"last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
"last_clean_scrub_stamp": "2019-03-15 01:33:21.447438"
},
"stats": {
"version": "23265'20886859",
"reported_seq": "10109937",
"reported_epoch": "23265",
"state": "active+clean+inconsistent",
"last_fresh": "2019-03-25 15:52:53.720768",
"last_change": "2019-03-22 22:39:13.931038",
"last_active": "2019-03-25 15:52:53.720768",
"last_peered": "2019-03-25 15:52:53.720768",
"last_clean": "2019-03-25 15:52:53.720768",
"last_became_active": "0.00",
"last_became_peered": "0.00",
"last_unstale": "2019-03-25 15:52:53.720768",
"last_undegraded": "2019-03-25 15:52:53.720768",
"last_fullsized": "2019-03-25 15:52:53.720768",
"mapping_epoch": 21472,
"log_start": "23265'20883809",
"ondisk_log_start": "23265'20883809",
"created": 8200,
"last_epoch_clean": 21487,
"parent": "0.0",
"parent_split_bits": 0,
"last_scrub": "23265'20864209",
"last_scrub_stamp": "2019-03-22 22:39:13.930673",
"last_deep_scrub": "23265'20864209",
"last_deep_scrub_stamp": "2019-03-22 22:39:13.930673",
"last_clean_scrub_stamp": "2019-03-15 01:33:21.447438",
"log_size": 3050,
"ondisk_log_size": 3050,
"stats_invalid": "0",
"stat_sum": {
"num_bytes": 8220278746,
"num_objects": 345034,
"num_object_clones": 0,
"num_object_copies": 1035102,
"num_objects_missing_on_primary": 0,
"num_objects_degraded": 0,
"num_objects_misplaced": 0,
"num_objects_unfound": 0,
"num_objects_dirty": 345034,
"num_whiteouts": 0,
"num_read": 7904350,
"num_read_kb": 58116568,
"num_write": 8753504,
"num_write_kb": 85104263,
"num_scrub_errors": 47,
"num_shallow_scrub_errors": 47,
"num_deep_scrub_errors": 0,
"num_objects_recovered": 167138,
"num_bytes_recovered": 5193543924,
"num_keys_recovered": 0,
"num_objects_omap": 0,
"num_objects_hit_set_archive": 0,
"num_bytes_hit_set_archive": 0
},
"up": [
41,
38,
8
],
"acting": [
41,
38,
8
],
"blocked_by": [],
"up_primary": 41,
"acting_primary": 41
},
"empty": 0,
"dne": 0,
"incomplete": 0,
"last_epoch_started": 21481,
"hit_set_history": {
"current_last_update": "0'0",
"current_last_stamp": "0.00",
"current_info": {
"begin": "0.00",
"end": "0.00",
"version": "0'0",
"using_gmt": "0"
},
"history": []
}
},
"peer_info": [
{
"peer": "8",
"pgid": "10.2a",
"last_update": "23265'20886859",
"last_complete": "23265'20886859",
"log_tail": "21395'11840466",
"last_user_version": 11843648,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 8200,
"last_epoch_started": 21481,
"last_epoch_clean": 21487,
"last_epoch_split": 0,

Re: [ceph-users] scrub errors

2019-03-25 Thread Brad Hubbard
It would help to know what version you are running but, to begin with,
could you post the output of the following?

$ sudo ceph pg 10.2a query
$ sudo rados list-inconsistent-obj 10.2a --format=json-pretty

Also, have a read of
http://docs.ceph.com/docs/mimic/rados/troubleshooting/troubleshooting-pg/
(adjust the URl for your release).

On Tue, Mar 26, 2019 at 8:19 AM solarflow99  wrote:
>
> I noticed my cluster has scrub errors but the deep-scrub command doesn't show 
> any errors.  Is there any way to know what it takes to fix it?
>
>
>
> # ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 47 scrub errors
> pg 10.2a is active+clean+inconsistent, acting [41,38,8]
> 47 scrub errors
>
> # zgrep 10.2a /var/log/ceph/ceph.log*
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 16:20:18.148299 osd.41 
> 192.168.4.19:6809/30077 54885 : cluster [INF] 10.2a deep-scrub starts
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024040 osd.41 
> 192.168.4.19:6809/30077 54886 : cluster [ERR] 10.2a shard 38 missing 
> 10/24083d2a/ec50777d-cc99-46a8-8610-4492213f412f/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024049 osd.41 
> 192.168.4.19:6809/30077 54887 : cluster [ERR] 10.2a shard 38 missing 
> 10/ff183d2a/fce859b9-61a9-46cb-82f1-4b4af31c10db/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024074 osd.41 
> 192.168.4.19:6809/30077 54888 : cluster [ERR] 10.2a shard 38 missing 
> 10/34283d2a/4b7c96cb-c494-4637-8669-e42049bd0e1c/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024076 osd.41 
> 192.168.4.19:6809/30077 54889 : cluster [ERR] 10.2a shard 38 missing 
> 10/df283d2a/bbe61149-99f8-4b83-a42b-b208d18094a8/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024077 osd.41 
> 192.168.4.19:6809/30077 54890 : cluster [ERR] 10.2a shard 38 missing 
> 10/35383d2a/60e8ed9b-bd04-5a43-8917-6f29eba28a66:0014/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024078 osd.41 
> 192.168.4.19:6809/30077 54891 : cluster [ERR] 10.2a shard 38 missing 
> 10/d5383d2a/2bdeb186-561b-4151-b87e-fe7c2e217d41/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024080 osd.41 
> 192.168.4.19:6809/30077 54892 : cluster [ERR] 10.2a shard 38 missing 
> 10/a7383d2a/b6b9d21d-2f4f-4550-8928-52552349db7d/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024081 osd.41 
> 192.168.4.19:6809/30077 54893 : cluster [ERR] 10.2a shard 38 missing 
> 10/9c383d2a/5b552687-c709-4e87-b773-1cce5b262754/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024082 osd.41 
> 192.168.4.19:6809/30077 54894 : cluster [ERR] 10.2a shard 38 missing 
> 10/5d383d2a/cb1a2ea8-0872-4de9-8b93-5ea8d9d8e613/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024083 osd.41 
> 192.168.4.19:6809/30077 54895 : cluster [ERR] 10.2a shard 38 missing 
> 10/8f483d2a/74c7a2b9-f00a-4c89-afbd-c1b8439234ac/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024085 osd.41 
> 192.168.4.19:6809/30077 54896 : cluster [ERR] 10.2a shard 38 missing 
> 10/b1583d2a/b3f00768-82a2-4637-91d1-164f3a51312a/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024086 osd.41 
> 192.168.4.19:6809/30077 54897 : cluster [ERR] 10.2a shard 38 missing 
> 10/35583d2a/e347aff4-7b71-476e-863a-310e767e4160/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024088 osd.41 
> 192.168.4.19:6809/30077 54898 : cluster [ERR] 10.2a shard 38 missing 
> 10/69583d2a/0805d07a-49d1-44cb-87c7-3bd73a0ce692/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024122 osd.41 
> 192.168.4.19:6809/30077 54899 : cluster [ERR] 10.2a shard 38 missing 
> 10/1a583d2a/d65bcf6a-9457-46c3-8fbc-432ebbaad89a/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024123 osd.41 
> 192.168.4.19:6809/30077 54900 : cluster [ERR] 10.2a shard 38 missing 
> 10/6d583d2a/5592f7d6-a131-4eb2-a3dd-b2d96691dd7e/head
> /var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024124 osd.41 
> 192.168.4.19:6809/30077 54901 : cluster [ERR] 10.2a shard 38 missing 
> 10/f0683d2a/81897399-4cb0-59b3-b9ae-bf043a272137:0003/head
>
>
>
> # ceph pg deep-scrub 10.2a
> instructing pg 10.2a on osd.41 to deep-scrub
>
>
> # ceph -w | grep 10.2a
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub errors

2019-03-25 Thread solarflow99
I noticed my cluster has scrub errors but the deep-scrub command doesn't
show any errors.  Is there any way to know what it takes to fix it?



# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 47 scrub errors
pg 10.2a is active+clean+inconsistent, acting [41,38,8]
47 scrub errors

# zgrep 10.2a /var/log/ceph/ceph.log*
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 16:20:18.148299 osd.41
192.168.4.19:6809/30077 54885 : cluster [INF] 10.2a deep-scrub starts
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024040 osd.41
192.168.4.19:6809/30077 54886 : cluster [ERR] 10.2a shard 38 missing
10/24083d2a/ec50777d-cc99-46a8-8610-4492213f412f/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024049 osd.41
192.168.4.19:6809/30077 54887 : cluster [ERR] 10.2a shard 38 missing
10/ff183d2a/fce859b9-61a9-46cb-82f1-4b4af31c10db/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024074 osd.41
192.168.4.19:6809/30077 54888 : cluster [ERR] 10.2a shard 38 missing
10/34283d2a/4b7c96cb-c494-4637-8669-e42049bd0e1c/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024076 osd.41
192.168.4.19:6809/30077 54889 : cluster [ERR] 10.2a shard 38 missing
10/df283d2a/bbe61149-99f8-4b83-a42b-b208d18094a8/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024077 osd.41
192.168.4.19:6809/30077 54890 : cluster [ERR] 10.2a shard 38 missing
10/35383d2a/60e8ed9b-bd04-5a43-8917-6f29eba28a66:0014/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024078 osd.41
192.168.4.19:6809/30077 54891 : cluster [ERR] 10.2a shard 38 missing
10/d5383d2a/2bdeb186-561b-4151-b87e-fe7c2e217d41/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024080 osd.41
192.168.4.19:6809/30077 54892 : cluster [ERR] 10.2a shard 38 missing
10/a7383d2a/b6b9d21d-2f4f-4550-8928-52552349db7d/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024081 osd.41
192.168.4.19:6809/30077 54893 : cluster [ERR] 10.2a shard 38 missing
10/9c383d2a/5b552687-c709-4e87-b773-1cce5b262754/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024082 osd.41
192.168.4.19:6809/30077 54894 : cluster [ERR] 10.2a shard 38 missing
10/5d383d2a/cb1a2ea8-0872-4de9-8b93-5ea8d9d8e613/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024083 osd.41
192.168.4.19:6809/30077 54895 : cluster [ERR] 10.2a shard 38 missing
10/8f483d2a/74c7a2b9-f00a-4c89-afbd-c1b8439234ac/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024085 osd.41
192.168.4.19:6809/30077 54896 : cluster [ERR] 10.2a shard 38 missing
10/b1583d2a/b3f00768-82a2-4637-91d1-164f3a51312a/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024086 osd.41
192.168.4.19:6809/30077 54897 : cluster [ERR] 10.2a shard 38 missing
10/35583d2a/e347aff4-7b71-476e-863a-310e767e4160/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024088 osd.41
192.168.4.19:6809/30077 54898 : cluster [ERR] 10.2a shard 38 missing
10/69583d2a/0805d07a-49d1-44cb-87c7-3bd73a0ce692/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024122 osd.41
192.168.4.19:6809/30077 54899 : cluster [ERR] 10.2a shard 38 missing
10/1a583d2a/d65bcf6a-9457-46c3-8fbc-432ebbaad89a/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024123 osd.41
192.168.4.19:6809/30077 54900 : cluster [ERR] 10.2a shard 38 missing
10/6d583d2a/5592f7d6-a131-4eb2-a3dd-b2d96691dd7e/head
/var/log/ceph/ceph.log-20190323.gz:2019-03-22 18:29:02.024124 osd.41
192.168.4.19:6809/30077 54901 : cluster [ERR] 10.2a shard 38 missing
10/f0683d2a/81897399-4cb0-59b3-b9ae-bf043a272137:0003/head



# ceph pg deep-scrub 10.2a
instructing pg 10.2a on osd.41 to deep-scrub


# ceph -w | grep 10.2a
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors

2018-10-23 Thread Sergey Malinin
There is an osd_scrub_auto_repair setting which defaults to 'false'.


> On 23.10.2018, at 12:12, Dominque Roux  wrote:
> 
> Hi all,
> 
> We lately faced several scrub errors.
> All of them were more or less easily fixed with the ceph pg repair X.Y
> command.
> 
> We're using ceph version 12.2.7 and have SSD and HDD pools.
> 
> Is there a way to prevent our datastore from these kind of errors, or is
> there a way to automate the fix (It would be rather easy to create a
> bash script)
> 
> Thank you very much for your help!
> 
> Best regards,
> 
> Dominique
> 
> -- 
> Your Swiss, Open Source and IPv6 Virtual Machine. Now on
> www.datacenterlight.ch
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub errors

2018-10-23 Thread Dominque Roux
Hi all,

We lately faced several scrub errors.
All of them were more or less easily fixed with the ceph pg repair X.Y
command.

We're using ceph version 12.2.7 and have SSD and HDD pools.

Is there a way to prevent our datastore from these kind of errors, or is
there a way to automate the fix (It would be rather easy to create a
bash script)

Thank you very much for your help!

Best regards,

Dominique

-- 
Your Swiss, Open Source and IPv6 Virtual Machine. Now on
www.datacenterlight.ch
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Errors

2016-05-06 Thread Blade

Oliver Dzombic  writes:

> 
> Hi Blade,
> 
> you can try to set the min_size to 1, to get it back online, and if/when
> the error vanish ( maybe after another repair command ) you can set the
> min_size again to 2.
> 
> you can try to simply out/down/?remove? the osd where it is on.
> 


Hi Oliver,

Thanks much for your suggestions!

So setting all pools min replication size down to 1 did get my cluster back
on-line.  I was then able to "repair" page 1.32.

However, there were still "139 scrub errors" which did not "repair".  Again,
I would issue a "pg repair" command but the OSD did not seem to get the
command.  

Then I restarted one of the OSD that had a page in an inconsistent state,
and again asked Ceph to repair the page, and this time it worked!!!  

So I wrote a quick for-loop to issue a repair command for each page:
for pg in $(ceph health detail | grep ^pg | awk '{print $2}'); do ceph pg
repair $pg; done

while running that and watching the OSD logs I saw that after the first few
repairs, the repair commands were not actually getting to the OSD owning the
pgs.  And again, restarting an OSD before sending a repair command fixed
that.  (Is it possible there is a queue of repair requests and if a repair
fails it blocks the queue?)  After many OSD restarts I finally repaired all
the pages.

So I am both very happy that my cluster is fixed now, but also very confused
about why the OSDs need to be restarted repeatedly for the repair commands
to run.

Now Im in the process of increasing the replication level back up.

Thanks again,
Blade.





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Errors

2016-05-04 Thread Oliver Dzombic
Hi Blade,

you can try to set the min_size to 1, to get it back online, and if/when
the error vanish ( maybe after another repair command ) you can set the
min_size again to 2.

you can try to simply out/down/?remove? the osd where it is on.


-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 04.05.2016 um 22:46 schrieb Blade Doyle:
> 
> When I issue the "ceph pg repair 1.32" command I *do* see it reported in
> the "ceph -w" output but I *do not* see any new messages about page 1.32
> in the log of osd.6 - even if I turn debug messages way up. 
> 
> # ceph pg repair 1.32
> instructing pg 1.32 on osd.6 to repair
> 
> (ceph -w shows)
> 2016-05-04 11:19:50.528355 mon.0 [INF] from='client.?
> 192.168.2.224:0/1341169978 '
> entity='client.admin' cmd=[{"prefix": "pg repair", "pgid": "1.32"}]:
> dispatch
> 
> ---
> 
> Yes, I also noticed that there is only one copy of that pg.  I have no
> idea how it happened, but my pools (all of them) got set to replication
> size=1.  I re-set them back to the intended values as soon as I noticed
> it.  Currently the pools are configured like this:
> 
> # ceph osd pool ls detail
> pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash
> rjenkins pg_num 64 pgp_num 64 last_change 349499 flags hashpspool
> stripe_width 0
> removed_snaps [1~d]
> pool 1 'cephfs_data' replicated size 2 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 300 pgp_num 300 last_change 349490 lfor
> 25902 flags hashpspool crash_replay_interval 45 tiers 4 read_tier 4
> write_tier 4 stripe_width 0
> pool 2 'cephfs_metadata' replicated size 2 min_size 2 crush_ruleset 0
> object_hash rjenkins pg_num 300 pgp_num 300 last_change 349503 flags
> hashpspool stripe_width 0
> pool 4 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 0
> object_hash rjenkins pg_num 256 pgp_num 256 last_change 349490 flags
> hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes
> 126701535232 target_objects 100 hit_set
> bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s
> x2 min_read_recency_for_promote 1 stripe_width 0
> 
> # ceph osd tree
> ID  WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -12 0.3 root ssd_cache
>  -4 0.2 host node11
>   8 0.2 osd.8up  1.0  1.0
> -11 0.2 host node13
>   0 0.2 osd.0up  1.0  1.0
>  -1 2.7 root default
>  -7 0.2 host node6
>   7 0.2 osd.7up  0.72400  1.0
>  -8 0.23000 host node5
>   5 0.23000 osd.5up  0.67996  1.0
>  -6 0.45999 host node12
>   9 0.45999 osd.9up  0.72157  1.0
> -10 0.67000 host node14
>  10 0.67000 osd.10   up  0.70659  1.0
> -13 0.67000 host node22
>   6 0.67000 osd.6up  0.69070  1.0
> -15 0.67000 host node21
>  11 0.67000 osd.11   up  0.69788  1.0
> 
> --
> 
> For the most part data in my ceph cluster is not critical.  Also, I have
> a recent backup.  At this point I would be happy to resolve the pg
> problems "any way possible" in order to get it working again.  Can I
> just delete the problematic pg (or the versions of it that are broken)?
> 
> I tried some commands to "accept the missing objects as lost" but it
> tells me:
> 
> # ceph pg 1.32 mark_unfound_lost delete
> pg has no unfound objects
> 
> The osd log for that is:
> 2016-05-04 11:31:03.742453 9b088350  0 osd.6 350327 do_command r=0
> 2016-05-04 11:31:03.763017 9b088350  0 osd.6 350327 do_command r=0 pg
> has no unfound objects
> 2016-05-04 11:31:03.763066 9b088350  0 log_channel(cluster) log [INF] :
> pg has no unfound objects
> 
> 
> I also tried to "force create" the page:
> # ceph pg force_create_pg 1.32
> pg 1.32 now creating, ok
> 
> In that case, I do see a dispatch:
> 2016-05-04 11:32:42.073625 mon.4 [INF] from='client.?
> 192.168.2.224:0/208882728 '
> entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid":
> "1.32"}]: dispatch
> 2016-05-04 11:32:42.075024 mon.0 [INF] from='client.17514719 :/0'
> entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid":
> "1.32"}]: dispatch
> 2016-05-04 11:32:42.183389 mon.0 [INF] from='client.17514719 :/0'
> entity='client.admin' cmd='[{"prefix": "pg force_create_pg", "pgid":
> "1.32"}]': finished
> 
> That puts the page in a new state for a while:
> # ceph health detail | grep 1.32
> pg 1.32 is stuck inactive since forever, current state creating, last
> acting []
> pg 1.32 is stuck unclean since forever, current state creating, last
> acting []
> 
> But 

Re: [ceph-users] Scrub Errors

2016-05-04 Thread Blade Doyle
When I issue the "ceph pg repair 1.32" command I *do* see it reported in
the "ceph -w" output but I *do not* see any new messages about page 1.32 in
the log of osd.6 - even if I turn debug messages way up.

# ceph pg repair 1.32
instructing pg 1.32 on osd.6 to repair

(ceph -w shows)
2016-05-04 11:19:50.528355 mon.0 [INF] from='client.?
192.168.2.224:0/1341169978' entity='client.admin' cmd=[{"prefix": "pg
repair", "pgid": "1.32"}]: dispatch

---

Yes, I also noticed that there is only one copy of that pg.  I have no idea
how it happened, but my pools (all of them) got set to replication size=1.
I re-set them back to the intended values as soon as I noticed it.
Currently the pools are configured like this:

# ceph osd pool ls detail
pool 0 'rbd' replicated size 2 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 349499 flags hashpspool
stripe_width 0
removed_snaps [1~d]
pool 1 'cephfs_data' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 300 pgp_num 300 last_change 349490 lfor 25902
flags hashpspool crash_replay_interval 45 tiers 4 read_tier 4 write_tier 4
stripe_width 0
pool 2 'cephfs_metadata' replicated size 2 min_size 2 crush_ruleset 0
object_hash rjenkins pg_num 300 pgp_num 300 last_change 349503 flags
hashpspool stripe_width 0
pool 4 'ssd_cache' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 256 pgp_num 256 last_change 349490 flags
hashpspool,incomplete_clones tier_of 1 cache_mode writeback target_bytes
126701535232 target_objects 100 hit_set
bloom{false_positive_probability: 0.05, target_size: 0, seed: 0} 3600s x2
min_read_recency_for_promote 1 stripe_width 0

# ceph osd tree
ID  WEIGHT  TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-12 0.3 root ssd_cache
 -4 0.2 host node11
  8 0.2 osd.8up  1.0  1.0
-11 0.2 host node13
  0 0.2 osd.0up  1.0  1.0
 -1 2.7 root default
 -7 0.2 host node6
  7 0.2 osd.7up  0.72400  1.0
 -8 0.23000 host node5
  5 0.23000 osd.5up  0.67996  1.0
 -6 0.45999 host node12
  9 0.45999 osd.9up  0.72157  1.0
-10 0.67000 host node14
 10 0.67000 osd.10   up  0.70659  1.0
-13 0.67000 host node22
  6 0.67000 osd.6up  0.69070  1.0
-15 0.67000 host node21
 11 0.67000 osd.11   up  0.69788  1.0

--

For the most part data in my ceph cluster is not critical.  Also, I have a
recent backup.  At this point I would be happy to resolve the pg problems
"any way possible" in order to get it working again.  Can I just delete the
problematic pg (or the versions of it that are broken)?

I tried some commands to "accept the missing objects as lost" but it tells
me:

# ceph pg 1.32 mark_unfound_lost delete
pg has no unfound objects

The osd log for that is:
2016-05-04 11:31:03.742453 9b088350  0 osd.6 350327 do_command r=0
2016-05-04 11:31:03.763017 9b088350  0 osd.6 350327 do_command r=0 pg has
no unfound objects
2016-05-04 11:31:03.763066 9b088350  0 log_channel(cluster) log [INF] : pg
has no unfound objects


I also tried to "force create" the page:
# ceph pg force_create_pg 1.32
pg 1.32 now creating, ok

In that case, I do see a dispatch:
2016-05-04 11:32:42.073625 mon.4 [INF] from='client.?
192.168.2.224:0/208882728' entity='client.admin' cmd=[{"prefix": "pg
force_create_pg", "pgid": "1.32"}]: dispatch
2016-05-04 11:32:42.075024 mon.0 [INF] from='client.17514719 :/0'
entity='client.admin' cmd=[{"prefix": "pg force_create_pg", "pgid":
"1.32"}]: dispatch
2016-05-04 11:32:42.183389 mon.0 [INF] from='client.17514719 :/0'
entity='client.admin' cmd='[{"prefix": "pg force_create_pg", "pgid":
"1.32"}]': finished

That puts the page in a new state for a while:
# ceph health detail | grep 1.32
pg 1.32 is stuck inactive since forever, current state creating, last
acting []
pg 1.32 is stuck unclean since forever, current state creating, last acting
[]

But after a few minutes it returns to the previous state:

# ceph health detail | grep 1.32
pg 1.32 is stuck inactive for 160741.831891, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck unclean for 1093042.263678, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck undersized for 57229.481051, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck degraded for 57229.481382, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is undersized+degraded+peered, acting [6]

Blade.


On Tue, May 3, 2016 at 10:45 AM, Oliver Dzombic 
wrote:

> Hi Blade,
>
> if you dont see anything in the logs, then you should raise the debug
> level/frequency.
>
> You must at least see, that the repair command has been issued  ( started
> ).
>
> Also i am wondering about the [6] from your output.
>
> 

Re: [ceph-users] Scrub Errors

2016-05-03 Thread Oliver Dzombic
Hi Blade,

if you dont see anything in the logs, then you should raise the debug
level/frequency.

You must at least see, that the repair command has been issued  ( started ).

Also i am wondering about the [6] from your output.

That means, that there is only 1 copy of it ( on osd.6 ).

What is your setting for the minimal required copies ?

osd_pool_default_min_size = ??

And whats the setting for the to create copies ?

osd_pool_default_size = ???

Please give us the output of

ceph osd pool ls detail

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 03.05.2016 um 19:11 schrieb Blade Doyle:
> Hi Oliver,
> 
> Thanks for your reply.
> 
> The problem could have been caused by crashing/flapping OSD's. The
> cluster is stable now, but lots of pg problems remain.
> 
> $ ceph health
> HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1
> pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs
> undersized; recovery 1489/523934 objects degraded (0.284%); recovery
> 2620/523934 objects misplaced (0.500%); 158 scrub errors
> 
> Example: for pg 1.32 :
> 
> $ ceph health detail | grep "pg 1.32"
> pg 1.32 is stuck inactive for 13260.118985, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck unclean for 945560.550800, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck undersized for 12855.304944, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is stuck degraded for 12855.305305, current state
> undersized+degraded+peered, last acting [6]
> pg 1.32 is undersized+degraded+peered, acting [6]
> 
> I tried various things like:
> 
> $ ceph pg repair 1.32
> instructing pg 1.32 on osd.6 to repair
> 
> $ ceph pg deep-scrub 1.32
> instructing pg 1.32 on osd.6 to deep-scrub
> 
> Its odd that I never do see any log on osd.6 about scrubbing or
> repairing that pg (after waiting many hours).  I attached "ceph pg
> query" and a grep of osd logs for that page.  If there is a better way
> to provide large logs please let me know.
> 
> For reference the last mention of that pg in the logs is:
> 
> 2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418  kicking pg 1.32
> 2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v
> 338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c
> 349347/349347 349418/349418/349418) [] r=-1 lpr=349418
> pi=349346-349417/1 crt=338815'7743 lcod 0'0 inactive NOTIFY] lock
> 
> 
> Suggestions appreciated,
> Blade.
> 
> 
> 
> 
> On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle  > wrote:
> 
> Hi Ceph-Users,
> 
> Help with how to resolve these would be appreciated.
> 
> 2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log
> [INF] : 4.97 deep-scrub starts
> 2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640
>  >> 192.168.2.32:0/3983425916
>  pipe(0x27406000 sd=111 :6800 s=0
> pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really
> 192.168.2.32:0/3983425916  (socket
> is 192.168.2.32:38514/0 )
> 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0
> clones, 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137
> whiteouts, 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 deep-scrub 1 errors
> 2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log
> [INF] : 4.97 scrub starts
> 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones,
> 145/145 dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log
> [ERR] : 4.97 scrub 1 errors
> 
> Thanks Much,
> Blade.
> 
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrub Errors

2016-05-03 Thread Blade Doyle
Hi Oliver,

Thanks for your reply.

The problem could have been caused by crashing/flapping OSD's. The cluster
is stable now, but lots of pg problems remain.

$ ceph health
HEALTH_ERR 4 pgs degraded; 158 pgs inconsistent; 4 pgs stuck degraded; 1
pgs stuck inactive; 10 pgs stuck unclean; 4 pgs stuck undersized; 4 pgs
undersized; recovery 1489/523934 objects degraded (0.284%); recovery
2620/523934 objects misplaced (0.500%); 158 scrub errors

Example: for pg 1.32 :

$ ceph health detail | grep "pg 1.32"
pg 1.32 is stuck inactive for 13260.118985, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck unclean for 945560.550800, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck undersized for 12855.304944, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is stuck degraded for 12855.305305, current state
undersized+degraded+peered, last acting [6]
pg 1.32 is undersized+degraded+peered, acting [6]

I tried various things like:

$ ceph pg repair 1.32
instructing pg 1.32 on osd.6 to repair

$ ceph pg deep-scrub 1.32
instructing pg 1.32 on osd.6 to deep-scrub

Its odd that I never do see any log on osd.6 about scrubbing or repairing
that pg (after waiting many hours).  I attached "ceph pg query" and a grep
of osd logs for that page.  If there is a better way to provide large logs
please let me know.

For reference the last mention of that pg in the logs is:

2016-04-30 09:24:44.703785 975b9350 20 osd.6 349418  kicking pg 1.32
2016-04-30 09:24:44.703880 975b9350 30 osd.6 pg_epoch: 349418 pg[1.32( v
338815'7745 (20981'4727,338815'7745] local-les=349347 n=435 ec=17 les/c
349347/349347 349418/349418/349418) [] r=-1 lpr=349418 pi=349346-349417/1
crt=338815'7743 lcod 0'0 inactive NOTIFY] lock


Suggestions appreciated,
Blade.




On Sat, Apr 30, 2016 at 9:31 AM, Blade Doyle  wrote:

> Hi Ceph-Users,
>
> Help with how to resolve these would be appreciated.
>
> 2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log [INF] :
> 4.97 deep-scrub starts
> 2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640 >>
> 192.168.2.32:0/3983425916 pipe(0x27406000 sd=111 :6800 s=0 pgs=0 cs=0 l=0
> c=0x272da0a0).accept peer addr is really 192.168.2.32:0/3983425916
> (socket is 192.168.2.32:38514/0)
> 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145
> dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 deep-scrub 1 errors
> 2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log [INF] :
> 4.97 scrub starts
> 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145 dirty,
> 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts, 365855441/365855441
> bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 scrub 1 errors
>
> Thanks Much,
> Blade.
>
{
"state": "undersized+degraded+peered",
"snap_trimq": "[]",
"epoch": 350071,
"up": [
6
],
"acting": [
6
],
"actingbackfill": [
"6"
],
"info": {
"pgid": "1.32",
"last_update": "338815'7745",
"last_complete": "338815'7745",
"log_tail": "20981'4727",
"last_user_version": 99149,
"last_backfill": "MAX",
"purged_snaps": "[]",
"history": {
"epoch_created": 17,
"last_epoch_started": 349421,
"last_epoch_clean": 349491,
"last_epoch_split": 0,
"same_up_since": 349420,
"same_interval_since": 349490,
"same_primary_since": 349420,
"last_scrub": "338815'7745",
"last_scrub_stamp": "2016-04-21 22:05:56.984147",
"last_deep_scrub": "338815'7745",
"last_deep_scrub_stamp": "2016-04-21 22:05:56.984147",
"last_clean_scrub_stamp": "2016-04-21 22:05:56.984147"
},
"stats": {
"version": "338815'7745",
"reported_seq": "61243",
"reported_epoch": "350068",
"state": "undersized+degraded+peered",
"last_fresh": "2016-05-02 19:30:21.999749",
"last_change": "2016-05-02 17:10:46.95",
"last_active": "2016-05-02 17:04:01.016156",
"last_peered": "2016-05-02 19:30:21.999749",
"last_clean": "2016-04-21 22:05:40.584862",
"last_became_active": "0.00",
"last_became_peered": "0.00",
"last_unstale": "2016-05-02 19:30:21.999749",
"last_undegraded": "2016-05-02 17:10:45.831094",
"last_fullsized": "2016-05-02 17:10:45.831094",
"mapping_epoch": 349418,
"log_start": 

Re: [ceph-users] Scrub Errors

2016-04-30 Thread Oliver Dzombic
Hi,

please check with

ceph health

which pg's cause trouble.

Please try:

ceph pg repair 4.97

And look if it can be resolved.

If not, please paste the corresponding log.

That repair can take some time...

-- 
Mit freundlichen Gruessen / Best regards

Oliver Dzombic
IP-Interactive

mailto:i...@ip-interactive.de

Anschrift:

IP Interactive UG ( haftungsbeschraenkt )
Zum Sonnenberg 1-3
63571 Gelnhausen

HRB 93402 beim Amtsgericht Hanau
Geschäftsführung: Oliver Dzombic

Steuer Nr.: 35 236 3622 1
UST ID: DE274086107


Am 30.04.2016 um 18:31 schrieb Blade Doyle:
> Hi Ceph-Users,
> 
> Help with how to resolve these would be appreciated.
> 
> 2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log [INF] :
> 4.97 deep-scrub starts
> 2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640
>  >> 192.168.2.32:0/3983425916
>  pipe(0x27406000 sd=111 :6800 s=0
> pgs=0 cs=0 l=0 c=0x272da0a0).accept peer addr is really
> 192.168.2.32:0/3983425916  (socket is
> 192.168.2.32:38514/0 )
> 2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145
> dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 deep-scrub 1 errors
> 2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log [INF] :
> 4.97 scrub starts
> 2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145
> dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
> 365855441/365855441 bytes,340/340 hit_set_archive bytes.
> 2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log [ERR] :
> 4.97 scrub 1 errors
> 
> Thanks Much,
> Blade.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scrub Errors

2016-04-30 Thread Blade Doyle
Hi Ceph-Users,

Help with how to resolve these would be appreciated.

2016-04-30 09:25:58.399634 9b809350  0 log_channel(cluster) log [INF] :
4.97 deep-scrub starts
2016-04-30 09:26:00.041962 93009350  0 -- 192.168.2.52:6800/6640 >>
192.168.2.32:0/3983425916 pipe(0x27406000 sd=111 :6800 s=0 pgs=0 cs=0 l=0
c=0x272da0a0).accept peer addr is really 192.168.2.32:0/3983425916 (socket
is 192.168.2.32:38514/0)
2016-04-30 09:26:15.415883 9b809350 -1 log_channel(cluster) log [ERR] :
4.97 deep-scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145
dirty, 0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts,
365855441/365855441 bytes,340/340 hit_set_archive bytes.
2016-04-30 09:26:15.415953 9b809350 -1 log_channel(cluster) log [ERR] :
4.97 deep-scrub 1 errors
2016-04-30 09:26:15.416425 9b809350  0 log_channel(cluster) log [INF] :
4.97 scrub starts
2016-04-30 09:26:15.682311 9b809350 -1 log_channel(cluster) log [ERR] :
4.97 scrub stat mismatch, got 284/282 objects, 0/0 clones, 145/145 dirty,
0/0 omap, 4/2 hit_set_archive, 137/137 whiteouts, 365855441/365855441
bytes,340/340 hit_set_archive bytes.
2016-04-30 09:26:15.682392 9b809350 -1 log_channel(cluster) log [ERR] :
4.97 scrub 1 errors

Thanks Much,
Blade.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] scrub errors continue with 0.80.4

2014-07-18 Thread Randy Smith
Greetings,

I upgraded to 0.80.4 last night to resolve the inconsistent pg scrub errors
I was seeing. Unfortunately, they are continuing.

$ ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
pg 3.7f is active+clean+inconsistent, acting [0,4]

And here's the relevant log entries.

2014-07-18 15:04:29.005110 osd.0 192.168.253.92:6800/11023 61 : [ERR] 3.7f
shard 0: soid db656f7f/rb.0.233d.5832ae11.1bb3/head//3 digest
474812490 != known digest 3411276363
2014-07-18 15:04:30.736750 osd.0 192.168.253.92:6800/11023 62 : [ERR] 3.7f
deep-scrub 0 missing, 1 inconsistent objects
2014-07-18 15:04:30.736756 osd.0 192.168.253.92:6800/11023 63 : [ERR] 3.7f
deep-scrub 1 errors

A tarball containing the pgs is at
http://people.adams.edu/~rbsmith/osd-2014-07-18.tar.gz

The servers are running Ubuntu precise.

$ lsb_release -a
LSB Version:
 
core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch
Distributor ID: Ubuntu
Description:Ubuntu 12.04.4 LTS
Release:12.04
Codename:   precise

$ uname -a
Linux scooby 3.2.0-64-generic #97-Ubuntu SMP Wed Jun 4 22:04:21 UTC 2014
x86_64 x86_64 x86_64 GNU/Linux

Just to confirm the osd versions:

$ ceph tell osd.0 version
{ version: ceph version 0.80.4
(7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}

$ ceph tell osd.4 version
{ version: ceph version 0.80.4
(7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}

-- 
Randall Smith
Computing Services
Adams State University
http://www.adams.edu/
719-587-7741
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors continue with 0.80.4

2014-07-18 Thread Gregory Farnum
The config option change in the upgrade will prevent *new* scrub
errors from occurring, but it won't resolve existing ones. You'll need
to run a scrub repair to fix those up.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Fri, Jul 18, 2014 at 2:59 PM, Randy Smith rbsm...@adams.edu wrote:
 Greetings,

 I upgraded to 0.80.4 last night to resolve the inconsistent pg scrub errors
 I was seeing. Unfortunately, they are continuing.

 $ ceph health detail
 HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
 pg 3.7f is active+clean+inconsistent, acting [0,4]

 And here's the relevant log entries.

 2014-07-18 15:04:29.005110 osd.0 192.168.253.92:6800/11023 61 : [ERR] 3.7f
 shard 0: soid db656f7f/rb.0.233d.5832ae11.1bb3/head//3 digest
 474812490 != known digest 3411276363
 2014-07-18 15:04:30.736750 osd.0 192.168.253.92:6800/11023 62 : [ERR] 3.7f
 deep-scrub 0 missing, 1 inconsistent objects
 2014-07-18 15:04:30.736756 osd.0 192.168.253.92:6800/11023 63 : [ERR] 3.7f
 deep-scrub 1 errors

 A tarball containing the pgs is at
 http://people.adams.edu/~rbsmith/osd-2014-07-18.tar.gz

 The servers are running Ubuntu precise.

 $ lsb_release -a
 LSB Version:
 core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch
 Distributor ID: Ubuntu
 Description:Ubuntu 12.04.4 LTS
 Release:12.04
 Codename:   precise

 $ uname -a
 Linux scooby 3.2.0-64-generic #97-Ubuntu SMP Wed Jun 4 22:04:21 UTC 2014
 x86_64 x86_64 x86_64 GNU/Linux

 Just to confirm the osd versions:

 $ ceph tell osd.0 version
 { version: ceph version 0.80.4
 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}

 $ ceph tell osd.4 version
 { version: ceph version 0.80.4
 (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}

 --
 Randall Smith
 Computing Services
 Adams State University
 http://www.adams.edu/
 719-587-7741

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors continue with 0.80.4

2014-07-18 Thread Randy Smith
Greg,

This error occurred AFTER the upgrade. I upgraded to 0.80.4 last night and
this error cropped up this afternoon. I ran `ceph pg repair 3.7f` (after I
copied the pgs) which returned the cluster to health. However, I'm
concerned that this showed up again so soon after I upgraded to 0.80.4.

Is there something I need to do to ensure that there are no lingering
errors waiting for a scrub to find or just wait it out for a couple of days
and hope that my data is safe?


On Fri, Jul 18, 2014 at 4:01 PM, Gregory Farnum g...@inktank.com wrote:

 The config option change in the upgrade will prevent *new* scrub
 errors from occurring, but it won't resolve existing ones. You'll need
 to run a scrub repair to fix those up.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Jul 18, 2014 at 2:59 PM, Randy Smith rbsm...@adams.edu wrote:
  Greetings,
 
  I upgraded to 0.80.4 last night to resolve the inconsistent pg scrub
 errors
  I was seeing. Unfortunately, they are continuing.
 
  $ ceph health detail
  HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
  pg 3.7f is active+clean+inconsistent, acting [0,4]
 
  And here's the relevant log entries.
 
  2014-07-18 15:04:29.005110 osd.0 192.168.253.92:6800/11023 61 : [ERR]
 3.7f
  shard 0: soid db656f7f/rb.0.233d.5832ae11.1bb3/head//3 digest
  474812490 != known digest 3411276363
  2014-07-18 15:04:30.736750 osd.0 192.168.253.92:6800/11023 62 : [ERR]
 3.7f
  deep-scrub 0 missing, 1 inconsistent objects
  2014-07-18 15:04:30.736756 osd.0 192.168.253.92:6800/11023 63 : [ERR]
 3.7f
  deep-scrub 1 errors
 
  A tarball containing the pgs is at
  http://people.adams.edu/~rbsmith/osd-2014-07-18.tar.gz
 
  The servers are running Ubuntu precise.
 
  $ lsb_release -a
  LSB Version:
 
 core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch
  Distributor ID: Ubuntu
  Description:Ubuntu 12.04.4 LTS
  Release:12.04
  Codename:   precise
 
  $ uname -a
  Linux scooby 3.2.0-64-generic #97-Ubuntu SMP Wed Jun 4 22:04:21 UTC 2014
  x86_64 x86_64 x86_64 GNU/Linux
 
  Just to confirm the osd versions:
 
  $ ceph tell osd.0 version
  { version: ceph version 0.80.4
  (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}
 
  $ ceph tell osd.4 version
  { version: ceph version 0.80.4
  (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}
 
  --
  Randall Smith
  Computing Services
  Adams State University
  http://www.adams.edu/
  719-587-7741
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 




-- 
Randall Smith
Computing Services
Adams State University
http://www.adams.edu/
719-587-7741
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scrub errors continue with 0.80.4

2014-07-18 Thread Gregory Farnum
It's just because the PG hadn't been scrubbed since the error occurred;
then you upgraded, it scrubbed, and the error was found. You can deep-scrub
all your PGs to check them if you like, but as I've said elsewhere this
issue -- while scary! -- shouldn't actually damage any of your user data,
so just letting it come up in the normal scrub schedule and running a
repair to fix any differences found ought to be fine.
-Greg

On Friday, July 18, 2014, Randy Smith rbsm...@adams.edu wrote:

 Greg,

 This error occurred AFTER the upgrade. I upgraded to 0.80.4 last night and
 this error cropped up this afternoon. I ran `ceph pg repair 3.7f` (after I
 copied the pgs) which returned the cluster to health. However, I'm
 concerned that this showed up again so soon after I upgraded to 0.80.4.

 Is there something I need to do to ensure that there are no lingering
 errors waiting for a scrub to find or just wait it out for a couple of days
 and hope that my data is safe?


 On Fri, Jul 18, 2014 at 4:01 PM, Gregory Farnum g...@inktank.com
 javascript:_e(%7B%7D,'cvml','g...@inktank.com'); wrote:

 The config option change in the upgrade will prevent *new* scrub
 errors from occurring, but it won't resolve existing ones. You'll need
 to run a scrub repair to fix those up.
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Jul 18, 2014 at 2:59 PM, Randy Smith rbsm...@adams.edu
 javascript:_e(%7B%7D,'cvml','rbsm...@adams.edu'); wrote:
  Greetings,
 
  I upgraded to 0.80.4 last night to resolve the inconsistent pg scrub
 errors
  I was seeing. Unfortunately, they are continuing.
 
  $ ceph health detail
  HEALTH_ERR 1 pgs inconsistent; 1 scrub errors
  pg 3.7f is active+clean+inconsistent, acting [0,4]
 
  And here's the relevant log entries.
 
  2014-07-18 15:04:29.005110 osd.0 192.168.253.92:6800/11023 61 : [ERR]
 3.7f
  shard 0: soid db656f7f/rb.0.233d.5832ae11.1bb3/head//3 digest
  474812490 != known digest 3411276363
  2014-07-18 15:04:30.736750 osd.0 192.168.253.92:6800/11023 62 : [ERR]
 3.7f
  deep-scrub 0 missing, 1 inconsistent objects
  2014-07-18 15:04:30.736756 osd.0 192.168.253.92:6800/11023 63 : [ERR]
 3.7f
  deep-scrub 1 errors
 
  A tarball containing the pgs is at
  http://people.adams.edu/~rbsmith/osd-2014-07-18.tar.gz
 
  The servers are running Ubuntu precise.
 
  $ lsb_release -a
  LSB Version:
 
 core-2.0-amd64:core-2.0-noarch:core-3.0-amd64:core-3.0-noarch:core-3.1-amd64:core-3.1-noarch:core-3.2-amd64:core-3.2-noarch:core-4.0-amd64:core-4.0-noarch
  Distributor ID: Ubuntu
  Description:Ubuntu 12.04.4 LTS
  Release:12.04
  Codename:   precise
 
  $ uname -a
  Linux scooby 3.2.0-64-generic #97-Ubuntu SMP Wed Jun 4 22:04:21 UTC 2014
  x86_64 x86_64 x86_64 GNU/Linux
 
  Just to confirm the osd versions:
 
  $ ceph tell osd.0 version
  { version: ceph version 0.80.4
  (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}
 
  $ ceph tell osd.4 version
  { version: ceph version 0.80.4
  (7c241cfaa6c8c068bc9da8578ca00b9f4fc7567f)}
 
  --
  Randall Smith
  Computing Services
  Adams State University
  http://www.adams.edu/
  719-587-7741
 
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com
 javascript:_e(%7B%7D,'cvml','ceph-users@lists.ceph.com');
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 




 --
 Randall Smith
 Computing Services
 Adams State University
 http://www.adams.edu/
 719-587-7741



-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com