Re: [ceph-users] Troubleshooting Incomplete PGs

Chris Kitzmiller Wed, 22 Oct 2014 17:08:54 -0700

On Oct 22, 2014, at 7:51 PM, Craig Lewis wrote:
> On Wed, Oct 22, 2014 at 3:09 PM, Chris Kitzmiller <[email protected]> 
> wrote:
>> On Oct 22, 2014, at 1:50 PM, Craig Lewis wrote:
>>> Incomplete means "Ceph detects that a placement group is missing a 
>>> necessary period of history from its log. If you see this state, report a 
>>> bug, and try to start any failed OSDs that may contain the needed 
>>> information".
>>> 
>>> In the PG query, it lists some OSDs that it's trying to probe:
>>>           "probing_osds": [
>>>                 "10",
>>>                 "13",
>>>                 "15",
>>>                 "25"],
>>>           "down_osds_we_would_probe": [],
>>> 
>>> Is one of those the OSD you replaced?  If so, you might try ceph pg {pg-id} 
>>> mark_unfound_lost revert|delete.  That command will lose data; it tells 
>>> Ceph to give up looking for data that it can't find, so you might want to 
>>> wait a bit.
>> 
>> Yes. osd.10 was the OSD I replaced. :( I suspect that I didn't actually have 
>> any writes during this time and that a revert might leave me in an OK place.
>> 
>> Looking at the query more closely I see that all of the peers (except 
>> osd.10) have the same value for 
>> last_update/last_complete/last_scrub/last_deep_scrub except that the peer 
>> entry on osd.10 has 0 values for everything. It's as if all my OSDs are 
>> believing in the ghost of this PG on osd.10. I'd like to revert I just want 
>> to make sure that I'm going to revert to the sane value and not the 0 value.
> 
> I've never (successfully) used mark_unfound_lost, so I can't say exactly 
> what'll happen.  revert should be what you need, but I don't know if it's 
> going to revert to the point in time before whatever hole in the history 
> happened, or if it will just give up on the portions of history that it 
> doesn't have.


Huh. So I tried `ceph pg 3.222 mark_unfound_lost revert` and it told me "pg has 
no unfound objects" and indeed: "num_objects_unfound": 0,

On one of the peers, osd.25 (which isn't in the acting set now and was up+in 
the whole time) it reports:

        "stat_sum": { "num_bytes": 7080120320,
                "num_objects": 1697,
                "num_object_clones": 0,
                "num_object_copies": 3394,
                "num_objects_missing_on_primary": 0,
                "num_objects_degraded": 0,
                "num_objects_unfound": 0,
                "num_objects_dirty": 1697,
                "num_whiteouts": 0,
                "num_read": 72828,
                "num_read_kb": 8794722,
                "num_write": 32405,
                "num_write_kb": 11424120,
                "num_scrub_errors": 0,
                "num_shallow_scrub_errors": 0,
                "num_deep_scrub_errors": 0,
                "num_objects_recovered": 1687,
                "num_bytes_recovered": 7038177280,
                "num_keys_recovered": 0,
                "num_objects_omap": 0,
                "num_objects_hit_set_archive": 0},

So, is it the 10 objects which are dirty but not recovered which are giving me 
trouble? What can be done to correct these PGs?
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Troubleshooting Incomplete PGs

Reply via email to