Shot in the dark: try manually deep-scrubbing the PG.  You could also try
marking various osd's OUT, in an attempt to get the acting set to include
osd.25 again, then do the deep-scrub again.  That probably won't help
though, because the pg query says it probed osd.25 already... actually , it
doesn't.  osd.25 is in "probing_osds" not "probed_osds". The deep-scrub
might move things along.


Re-reading your original post, if you marked the slow osds OUT, but left
them running, you should not have lost data.

If the scrubs don't help, it's probably time to hop on IRC.




On Wed, Oct 22, 2014 at 5:08 PM, Chris Kitzmiller <[email protected]
> wrote:

> On Oct 22, 2014, at 7:51 PM, Craig Lewis wrote:
> > On Wed, Oct 22, 2014 at 3:09 PM, Chris Kitzmiller <
> [email protected]> wrote:
> >> On Oct 22, 2014, at 1:50 PM, Craig Lewis wrote:
> >>> Incomplete means "Ceph detects that a placement group is missing a
> necessary period of history from its log. If you see this state, report a
> bug, and try to start any failed OSDs that may contain the needed
> information".
> >>>
> >>> In the PG query, it lists some OSDs that it's trying to probe:
> >>>           "probing_osds": [
> >>>                 "10",
> >>>                 "13",
> >>>                 "15",
> >>>                 "25"],
> >>>           "down_osds_we_would_probe": [],
> >>>
> >>> Is one of those the OSD you replaced?  If so, you might try ceph pg
> {pg-id} mark_unfound_lost revert|delete.  That command will lose data; it
> tells Ceph to give up looking for data that it can't find, so you might
> want to wait a bit.
> >>
> >> Yes. osd.10 was the OSD I replaced. :( I suspect that I didn't actually
> have any writes during this time and that a revert might leave me in an OK
> place.
> >>
> >> Looking at the query more closely I see that all of the peers (except
> osd.10) have the same value for
> last_update/last_complete/last_scrub/last_deep_scrub except that the peer
> entry on osd.10 has 0 values for everything. It's as if all my OSDs are
> believing in the ghost of this PG on osd.10. I'd like to revert I just want
> to make sure that I'm going to revert to the sane value and not the 0 value.
> >
> > I've never (successfully) used mark_unfound_lost, so I can't say exactly
> what'll happen.  revert should be what you need, but I don't know if it's
> going to revert to the point in time before whatever hole in the history
> happened, or if it will just give up on the portions of history that it
> doesn't have.
>
> Huh. So I tried `ceph pg 3.222 mark_unfound_lost revert` and it told me
> "pg has no unfound objects" and indeed: "num_objects_unfound": 0,
>
> On one of the peers, osd.25 (which isn't in the acting set now and was
> up+in the whole time) it reports:
>
>         "stat_sum": { "num_bytes": 7080120320,
>                 "num_objects": 1697,
>                 "num_object_clones": 0,
>                 "num_object_copies": 3394,
>                 "num_objects_missing_on_primary": 0,
>                 "num_objects_degraded": 0,
>                 "num_objects_unfound": 0,
>                 "num_objects_dirty": 1697,
>                 "num_whiteouts": 0,
>                 "num_read": 72828,
>                 "num_read_kb": 8794722,
>                 "num_write": 32405,
>                 "num_write_kb": 11424120,
>                 "num_scrub_errors": 0,
>                 "num_shallow_scrub_errors": 0,
>                 "num_deep_scrub_errors": 0,
>                 "num_objects_recovered": 1687,
>                 "num_bytes_recovered": 7038177280,
>                 "num_keys_recovered": 0,
>                 "num_objects_omap": 0,
>                 "num_objects_hit_set_archive": 0},
>
> So, is it the 10 objects which are dirty but not recovered which are
> giving me trouble? What can be done to correct these PGs?
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to