Re: [ceph-users] Repair inconsistent pgs..

Andrija Panic Thu, 20 Aug 2015 11:27:57 -0700

Guys,

I'm Igor's colleague, working a bit on CEPH,  together with Igor.


This is production cluster, and we are becoming more desperate as the time
goes by.

Im not sure if this is appropriate place to seek commercial support, but
anyhow, I do it...

If anyone feels like and have some experience in this particular PG
troubleshooting issues, we are also ready to seek for commercial support to
solve our issue, company or individual, it doesn't matter.


Thanks,
Andrija

On 20 August 2015 at 19:07, Voloshanenko Igor <[email protected]>
wrote:

> Inktank:
>
> https://download.inktank.com/docs/ICE%201.2%20-%20Cache%20and%20Erasure%20Coding%20FAQ.pdf
>
> Mail-list:
> https://www.mail-archive.com/[email protected]/msg18338.html
>
> 2015-08-20 20:06 GMT+03:00 Samuel Just <[email protected]>:
>
>> Which docs?
>> -Sam
>>
>> On Thu, Aug 20, 2015 at 9:57 AM, Voloshanenko Igor
>> <[email protected]> wrote:
>> > Not yet. I will create.
>> > But according to mail lists and Inktank docs - it's expected behaviour
>> when
>> > cache enable
>> >
>> > 2015-08-20 19:56 GMT+03:00 Samuel Just <[email protected]>:
>> >>
>> >> Is there a bug for this in the tracker?
>> >> -Sam
>> >>
>> >> On Thu, Aug 20, 2015 at 9:54 AM, Voloshanenko Igor
>> >> <[email protected]> wrote:
>> >> > Issue, that in forward mode, fstrim doesn't work proper, and when we
>> >> > take
>> >> > snapshot - data not proper update in cache layer, and client (ceph)
>> see
>> >> > damaged snap.. As headers requested from cache layer.
>> >> >
>> >> > 2015-08-20 19:53 GMT+03:00 Samuel Just <[email protected]>:
>> >> >>
>> >> >> What was the issue?
>> >> >> -Sam
>> >> >>
>> >> >> On Thu, Aug 20, 2015 at 9:41 AM, Voloshanenko Igor
>> >> >> <[email protected]> wrote:
>> >> >> > Samuel, we turned off cache layer few hours ago...
>> >> >> > I will post ceph.log in few minutes
>> >> >> >
>> >> >> > For snap - we found issue, was connected with cache tier..
>> >> >> >
>> >> >> > 2015-08-20 19:23 GMT+03:00 Samuel Just <[email protected]>:
>> >> >> >>
>> >> >> >> Ok, you appear to be using a replicated cache tier in front of a
>> >> >> >> replicated base tier.  Please scrub both inconsistent pgs and
>> post
>> >> >> >> the
>> >> >> >> ceph.log from before when you started the scrub until after.
>> Also,
>> >> >> >> what command are you using to take snapshots?
>> >> >> >> -Sam
>> >> >> >>
>> >> >> >> On Thu, Aug 20, 2015 at 3:59 AM, Voloshanenko Igor
>> >> >> >> <[email protected]> wrote:
>> >> >> >> > Hi Samuel, we try to fix it in trick way.
>> >> >> >> >
>> >> >> >> > we check all rbd_data chunks from logs (OSD) which are
>> affected,
>> >> >> >> > then
>> >> >> >> > query
>> >> >> >> > rbd info to compare which rbd consist bad rbd_data, after that
>> we
>> >> >> >> > mount
>> >> >> >> > this
>> >> >> >> > rbd as rbd0, create empty rbd, and DD all info from bad volume
>> to
>> >> >> >> > new
>> >> >> >> > one.
>> >> >> >> >
>> >> >> >> > But after that - scrub errors growing... Was 15 errors.. .Now
>> >> >> >> > 35...
>> >> >> >> > We
>> >> >> >> > laos
>> >> >> >> > try to out OSD which was lead, but after rebalancing this 2 pgs
>> >> >> >> > still
>> >> >> >> > have
>> >> >> >> > 35 scrub errors...
>> >> >> >> >
>> >> >> >> > ceph osd getmap -o <outfile> - attached
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > 2015-08-18 18:48 GMT+03:00 Samuel Just <[email protected]>:
>> >> >> >> >>
>> >> >> >> >> Is the number of inconsistent objects growing?  Can you attach
>> >> >> >> >> the
>> >> >> >> >> whole ceph.log from the 6 hours before and after the snippet
>> you
>> >> >> >> >> linked above?  Are you using cache/tiering?  Can you attach
>> the
>> >> >> >> >> osdmap
>> >> >> >> >> (ceph osd getmap -o <outfile>)?
>> >> >> >> >> -Sam
>> >> >> >> >>
>> >> >> >> >> On Tue, Aug 18, 2015 at 4:15 AM, Voloshanenko Igor
>> >> >> >> >> <[email protected]> wrote:
>> >> >> >> >> > ceph - 0.94.2
>> >> >> >> >> > Its happen during rebalancing
>> >> >> >> >> >
>> >> >> >> >> > I thought too, that some OSD miss copy, but looks like all
>> >> >> >> >> > miss...
>> >> >> >> >> > So any advice in which direction i need to go
>> >> >> >> >> >
>> >> >> >> >> > 2015-08-18 14:14 GMT+03:00 Gregory Farnum <
>> [email protected]>:
>> >> >> >> >> >>
>> >> >> >> >> >> From a quick peek it looks like some of the OSDs are
>> missing
>> >> >> >> >> >> clones
>> >> >> >> >> >> of
>> >> >> >> >> >> objects. I'm not sure how that could happen and I'd expect
>> the
>> >> >> >> >> >> pg
>> >> >> >> >> >> repair to handle that but if it's not there's probably
>> >> >> >> >> >> something
>> >> >> >> >> >> wrong; what version of Ceph are you running? Sam, is this
>> >> >> >> >> >> something
>> >> >> >> >> >> you've seen, a new bug, or some kind of config issue?
>> >> >> >> >> >> -Greg
>> >> >> >> >> >>
>> >> >> >> >> >> On Tue, Aug 18, 2015 at 6:27 AM, Voloshanenko Igor
>> >> >> >> >> >> <[email protected]> wrote:
>> >> >> >> >> >> > Hi all, at our production cluster, due high rebalancing
>> (((
>> >> >> >> >> >> > we
>> >> >> >> >> >> > have 2
>> >> >> >> >> >> > pgs in
>> >> >> >> >> >> > inconsistent state...
>> >> >> >> >> >> >
>> >> >> >> >> >> > root@temp:~# ceph health detail | grep inc
>> >> >> >> >> >> > HEALTH_ERR 2 pgs inconsistent; 18 scrub errors
>> >> >> >> >> >> > pg 2.490 is active+clean+inconsistent, acting [56,15,29]
>> >> >> >> >> >> > pg 2.c4 is active+clean+inconsistent, acting [56,10,42]
>> >> >> >> >> >> >
>> >> >> >> >> >> > From OSD logs, after recovery attempt:
>> >> >> >> >> >> >
>> >> >> >> >> >> > root@test:~# ceph pg dump | grep -i incons | cut -f 1 |
>> >> >> >> >> >> > while
>> >> >> >> >> >> > read
>> >> >> >> >> >> > i;
>> >> >> >> >> >> > do
>> >> >> >> >> >> > ceph pg repair ${i} ; done
>> >> >> >> >> >> > dumped all in format plain
>> >> >> >> >> >> > instructing pg 2.490 on osd.56 to repair
>> >> >> >> >> >> > instructing pg 2.c4 on osd.56 to repair
>> >> >> >> >> >> >
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:51:2015-08-18
>> 07:26:37.035910
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > f5759490/rbd_data.1631755377d7e.00000000000004da/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > 90c59490/rbd_data.eb486436f2beb.0000000000007a65/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:52:2015-08-18
>> 07:26:37.035960
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > fee49490/rbd_data.12483d3ba0794b.000000000000522f/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > f5759490/rbd_data.1631755377d7e.00000000000004da/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:53:2015-08-18
>> 07:26:37.036133
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > a9b39490/rbd_data.12483d3ba0794b.00000000000037b3/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > fee49490/rbd_data.12483d3ba0794b.000000000000522f/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:54:2015-08-18
>> 07:26:37.036243
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > bac19490/rbd_data.1238e82ae8944a.000000000000032e/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > a9b39490/rbd_data.12483d3ba0794b.00000000000037b3/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:55:2015-08-18
>> 07:26:37.036289
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > 98519490/rbd_data.123e9c2ae8944a.0000000000000807/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > bac19490/rbd_data.1238e82ae8944a.000000000000032e/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:56:2015-08-18
>> 07:26:37.036314
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > c3c09490/rbd_data.1238e82ae8944a.0000000000000c2b/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > 98519490/rbd_data.123e9c2ae8944a.0000000000000807/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:57:2015-08-18
>> 07:26:37.036363
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > 28809490/rbd_data.edea7460fe42b.00000000000001d9/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > c3c09490/rbd_data.1238e82ae8944a.0000000000000c2b/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:58:2015-08-18
>> 07:26:37.036432
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : deep-scrub 2.490
>> >> >> >> >> >> > e1509490/rbd_data.1423897545e146.00000000000009a6/head//2
>> >> >> >> >> >> > expected
>> >> >> >> >> >> > clone
>> >> >> >> >> >> > 28809490/rbd_data.edea7460fe42b.00000000000001d9/141//2
>> >> >> >> >> >> > /var/log/ceph/ceph-osd.56.log:59:2015-08-18
>> 07:26:38.548765
>> >> >> >> >> >> > 7f94663b3700
>> >> >> >> >> >> > -1
>> >> >> >> >> >> > log_channel(cluster) log [ERR] : 2.490 deep-scrub 17
>> errors
>> >> >> >> >> >> >
>> >> >> >> >> >> > So, how i can solve "expected clone" situation by hand?
>> >> >> >> >> >> > Thank in advance!
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> >
>> >> >> >> >> >> > _______________________________________________
>> >> >> >> >> >> > ceph-users mailing list
>> >> >> >> >> >> > [email protected]
>> >> >> >> >> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >> >
>> >> >> >> >
>> >> >> >> >
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
>>
>
>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 

Andrija Panić

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Repair inconsistent pgs..

Reply via email to