On Wed, Aug 31, 2016 at 2:11 PM, Goncalo Borges
<[email protected]> wrote:
> Hi Brad...
>
> Thanks for the feedback. I think we are making some progress.
>
> I have opened the following tracker issue: 
> http://tracker.ceph.com/issues/17177 .
>
> There I give pointers for all the logs, namely the result of the pg query and 
> all osd logs after increasing the log levels (debug_ms=1, debug_filestore=20 
> and debug_osd=30) during a manual deep-scrub operation of the inconsistent pg 
> (which by the way went fine).
>
> Regarding the question why this is happening, I do not know. We are running 
> the same version everywhere (including when the server hosting osd.78 was 
> included back in production). We never saw this in infernalis, and since we 
> upgraded to Jewel, it already happened more than once. Another reason why we 
> could be seeing the issue just now is because, only in Jewel, we are 
> massively increasing the number of osd servers. In Infernalis the setup was 
> quite stable during the whole time.
>
> Regarding understanding which OSD has the bad data, I think that we have 
> enough evidence to say that it is the primary (78), i.e.:
> - the affected object in the peers has the oldest (and same) timestap,
> - the pg migrated recently to osd.78, previous deep scrubs (prior to osd.78 
> becoming the primary) went ok, and  the information you pointed out in the pg 
> query result seems to point to inconsistencies between the peers and the 
> primary at the time osd.78 becomes the primary .
>
> Also, after diving into the logs of the manual deep scrub, I found the 
> following ERANGE message in the peers osd logs but nothing in the primary osd 
> log. This message is spitted out after a getattrs operation on the object. 
> The relevant extract of the logs for all osds follows after the email.
>
> 2016-08-31 00:55:01.444953 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49)  -ERANGE, len is 208
> 2016-08-31 00:55:01.444964 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49)  -ERANGE, got 104

This raises a possible theory.

Can I see xfs_info from each of the three OSD filesystems please?

>
> So it seems the problem may rely on the extended attributes of the object 
> which was not replicated properly.
>
> Now that I (think) I know that the primary is wrong, I do not want to use a 
> blind  'ceph repair'. However, this raises another question: Can I simply 
> manually delete the problematic object in osd.78 and trigger a ceph repair 
> afterwards (as described here: 
> http://ceph.com/planet/ceph-manually-repair-object/ )?  Since we are talking 
> about cephfs metadata pool, producing 0 size objects and with a heavy use of 
> omap information, I am just wondering if that procedure should be the same in 
> this case.
>
> Cheers
> Goncalo
>
>
> =======
>
> PRIMARY OSD 78:
>
> 2016-08-31 00:55:01.404186 7f8b2f8f6700 10 
> filestore(/var/lib/ceph/osd/ceph-78) stat 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0 (size 0)
> 2016-08-31 00:55:01.404194 7f8b2f8f6700 15 
> filestore(/var/lib/ceph/osd/ceph-78) getattrs 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head#
> 2016-08-31 00:55:01.404274 7f8b2f8f6700 20 
> filestore(/var/lib/ceph/osd/ceph-78) fgetattrs 394 getting '_'
> 2016-08-31 00:55:01.404292 7f8b2f8f6700 20 
> filestore(/var/lib/ceph/osd/ceph-78) fgetattrs 394 getting '_parent'
> 2016-08-31 00:55:01.404302 7f8b2f8f6700 20 
> filestore(/var/lib/ceph/osd/ceph-78) fgetattrs 394 getting 'snapset'
> 2016-08-31 00:55:01.404309 7f8b2f8f6700 20 
> filestore(/var/lib/ceph/osd/ceph-78) fgetattrs 394 getting '_layout'
> 2016-08-31 00:55:01.404316 7f8b2f8f6700 10 
> filestore(/var/lib/ceph/osd/ceph-78) getattrs no xattr exists in object_map r 
> = 0
> 2016-08-31 00:55:01.404319 7f8b2f8f6700 10 
> filestore(/var/lib/ceph/osd/ceph-78) getattrs 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0
> 2016-08-31 00:55:01.404358 7f8b2f8f6700 10 osd.78 pg_epoch: 23099 pg[5.3d0( v 
> 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 
> 22440/22440/0 19928/22439/22439) [78,59,49] r=0 lpr=22439 crt=23099'104736 
> lcod 23099'104737 mlcod 23099'104737 
> active+clean+scrubbing+deep+inconsistent] be_deep_scrub 
> 5:0bd6d154:::602.00000000:head seed 4294967295
>
> --- * ---
>
> PEER OSD 49
> 2016-08-31 00:55:01.444902 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49) stat 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0 (size 0)
> 2016-08-31 00:55:01.444909 7f2dcbaaa700 15 
> filestore(/var/lib/ceph/osd/ceph-49) getattrs 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head#
> 2016-08-31 00:55:01.444953 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49)  -ERANGE, len is 208
> 2016-08-31 00:55:01.444964 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49)  -ERANGE, got 104
> 2016-08-31 00:55:01.444967 7f2dcbaaa700 20 
> filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting '_'
> 2016-08-31 00:55:01.444974 7f2dcbaaa700 20 
> filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting '_parent'
> 2016-08-31 00:55:01.444980 7f2dcbaaa700 20 
> filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting 'snapset'
> 2016-08-31 00:55:01.444986 7f2dcbaaa700 20 
> filestore(/var/lib/ceph/osd/ceph-49) fgetattrs 315 getting '_layout'
> 2016-08-31 00:55:01.444992 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49) getattrs no xattr exists in object_map r 
> = 0
> 2016-08-31 00:55:01.444994 7f2dcbaaa700 10 
> filestore(/var/lib/ceph/osd/ceph-49) getattrs 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0
> 2016-08-31 00:55:01.444998 7f2dcbaaa700 10 osd.49 pg_epoch: 23099 pg[5.3d0( v 
> 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 
> 22440/22440/0 19928/22439/22439) [78,59,49] r=2 lpr=22439 pi=4173-22438/25 
> luod=0'0 crt=23099'104736 lcod 23099'104737 active] be_deep_scrub 
> 5:0bd6d154:::602.00000000:head seed 4294967295
>
> --- * ---
>
> PEER OSD 59
>
> 2016-08-31 00:55:01.417801 7f335510b700 10 
> filestore(/var/lib/ceph/osd/ceph-59) stat 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0 (size 0)
> 2016-08-31 00:55:01.417806 7f335510b700 15 
> filestore(/var/lib/ceph/osd/ceph-59) getattrs 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head#
> 2016-08-31 00:55:01.417836 7f335510b700 10 
> filestore(/var/lib/ceph/osd/ceph-59)  -ERANGE, len is 208
> 2016-08-31 00:55:01.417843 7f335510b700 10 
> filestore(/var/lib/ceph/osd/ceph-59)  -ERANGE, got 104
> 2016-08-31 00:55:01.417845 7f335510b700 20 
> filestore(/var/lib/ceph/osd/ceph-59) fgetattrs 473 getting '_'
> 2016-08-31 00:55:01.417850 7f335510b700 20 
> filestore(/var/lib/ceph/osd/ceph-59) fgetattrs 473 getting '_parent'
> 2016-08-31 00:55:01.417856 7f335510b700 20 
> filestore(/var/lib/ceph/osd/ceph-59) fgetattrs 473 getting 'snapset'
> 2016-08-31 00:55:01.417861 7f335510b700 20 
> filestore(/var/lib/ceph/osd/ceph-59) fgetattrs 473 getting '_layout'
> 2016-08-31 00:55:01.417866 7f335510b700 10 
> filestore(/var/lib/ceph/osd/ceph-59) getattrs no xattr exists in object_map r 
> = 0
> 2016-08-31 00:55:01.417867 7f335510b700 10 
> filestore(/var/lib/ceph/osd/ceph-59) getattrs 
> 5.3d0_head/#5:0bd6d154:::602.00000000:head# = 0
> 2016-08-31 00:55:01.417870 7f335510b700 10 osd.59 pg_epoch: 23099 pg[5.3d0( v 
> 23099'104738 (23099'101639,23099'104738] local-les=22440 n=257 ec=339 les/c/f 
> 22440/22440/0 19928/22439/22439) [78,59,49] r=1 lpr=22439 pi=19928-22438/1 
> luod=0'0 crt=23099'104736 lcod 23099'104737 active] be_deep_scrub 
> 5:0bd6d154:::602.00000000:head seed 4294967295
>



-- 
Cheers,
Brad
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to