Greg,

Thanks for providing this background on the incomplete state.

With that context, and a little more digging online and in our
environment, I was able to resolve the issue. My cluster is back in
health ok.

The key to fixing the incomplete state was the information provided by
pg query.  I did not have to change the min_size setting.  In addition
to your comments, these two references were helpful.

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-August/042102.html
http://tracker.ceph.com/issues/5226


The tail of `ceph pg 3.ea query` showed there were an number of osds
involved in servicing the backfill.

 "probing_osds": [
                10,
                11,
                30,
                37,
                39,
                54],
          "down_osds_we_would_probe": [],
          "peering_blocked_by": []},
        { "name": "Started",
          "enter_time": "2015-10-21 14:39:13.824613"}]}

After checking all the OSDs, I confirmed that only osd.11 had the pg
data and all the rest had an empty dir for pg 3.ea.  Because osd 10 was
listed first and had an empty copy of the pg, my assumption was it was
blocking the backfill.  I stopped osd.10 briefly and the state of pg
3.ea immediately entered "active+degraded+remapped+backfilling".  After
the backfill started started osd.10.  In particular, osd 11 became the
primary (as desired) and began backfilling osd 30.

 { "state": "active+degraded+remapped+backfilling",
  "up": [
        30,
        11],
  "acting": [
        11,
        30],

osd.10 was no longer holding up the start of backfill operation:

"recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2015-10-22 12:46:50.907955",
          "might_have_unfound": [
                { "osd": 10,
                  "status": "not queried"}],
          "recovery_progress": { "backfill_target": 30,
              "waiting_on_backfill": 0,

Based on the steps that triggered the original incomplete state, my
guess is that when I took osd.30 down and out to reformat, a number of
alternates (including osd.10)  were mapped as backfill targets for the
pg.  These operations didn't have a chance to start up before osd 30's
reformat completed and was back in the cluster.  At that point, pg 3.ea
was remapped again, leaving osd 10 at the top of the list.  Not having
any data, it blocked the backfill from osd 11 from starting.

Not sure if that was the exact cause, but it makes some sense.

Thanks again for the pointing me in a useful direction.

~jpr

On 10/21/2015 03:01 PM, Gregory Farnum wrote:
> I don't remember the exact timeline, but min_size is designed to
> prevent data loss from under-replicated objects (ie, if you only have
> 1 copy out of 3 and you lose that copy, you're in trouble, so maybe
> you don't want it to go active). Unfortunately it could also prevent
> the OSDs from replicating/backfilling the data to new OSDs in the case
> where you only had one copy left — that's fixed now, but wasn't
> initially. And in that case it reported the PG as incomplete (in later
> versions, PGs in this state get reported as undersized).
>
> So if you drop the min_size to 1, it will allow new writes to the PG
> (which might not be great), but it will also let the OSD go into the
> backfilling state. (At least, assuming the number of replicas is the
> only problem.). Based on your description of the problem I think this
> is the state you're in, and decreasing min_size is the solution.
> *shrug*
> You could also try and do something like extracting the PG from osd.11
> and copying it to osd.30, but that's quite tricky without the modern
> objectstore tool stuff, and I don't know if any of that works on
> dumpling (which it sounds like you're on — incidentally, you probably
> want to upgrade from that).
> -Greg
>
> On Wed, Oct 21, 2015 at 12:55 PM, John-Paul Robinson <j...@uab.edu> wrote:
>> Greg,
>>
>> Thanks for the insight.  I suspect things are somewhat sane given that I
>> did erase the primary (osd.30) and the secondary (osd.11) still contains
>> pg data.
>>
>> If I may, could you clarify the process of backfill a little?
>>
>> I understand the min_size allows I/O on the object to resume while there
>> are only that many replicas (ie. 1 once changed) and this would let
>> things move forward.
>>
>> I would expect, however, that some backfill would already be on-going
>> for pg 3.ea on osd.30.  As far as I can tell, there isn't anything
>> happening.  The pg 3.ea directory is just as empty today as it was
>> yesterday.
>>
>> Will changing the min_size actually trigger backfill to begin for an
>> object if has stalled or never got started?
>>
>> An alternative idea I had was to take osd.30 back out of the cluster so
>> that pg 3.ae [30,11] would get mapped to some other osd to maintain
>> replication.  This seems a bit heavy handed though, given that only this
>> one pg is affected.
>>
>> Thanks for any follow up.
>>
>> ~jpr
>>
>>
>> On 10/21/2015 01:21 PM, Gregory Farnum wrote:
>>> On Tue, Oct 20, 2015 at 7:22 AM, John-Paul Robinson <j...@uab.edu> wrote:
>>>> Hi folks
>>>>
>>>> I've been rebuilding drives in my cluster to add space.  This has gone
>>>> well so far.
>>>>
>>>> After the last batch of rebuilds, I'm left with one placement group in
>>>> an incomplete state.
>>>>
>>>> [sudo] password for jpr:
>>>> HEALTH_WARN 1 pgs incomplete; 1 pgs stuck inactive; 1 pgs stuck unclean
>>>> pg 3.ea is stuck inactive since forever, current state incomplete, last
>>>> acting [30,11]
>>>> pg 3.ea is stuck unclean since forever, current state incomplete, last
>>>> acting [30,11]
>>>> pg 3.ea is incomplete, acting [30,11]
>>>>
>>>> I've restarted both OSD a few times but it hasn't cleared the error.
>>>>
>>>> On the primary I see errors in the log related to slow requests:
>>>>
>>>> 2015-10-20 08:40:36.678569 7f361585c700  0 log [WRN] : 8 slow requests,
>>>> 3 included below; oldest blocked for > 31.922487 secs
>>>> 2015-10-20 08:40:36.678580 7f361585c700  0 log [WRN] : slow request
>>>> 31.531606 seconds old, received at 2015-10-20 08:40:05.146902:
>>>> osd_op(client.158903.1:343217143 rb.0.25cf8.238e1f29.00000000a044 [read
>>>> 1064960~262144] 3.ae9968ea RETRY) v4 currently reached pg
>>>> 2015-10-20 08:40:36.678592 7f361585c700  0 log [WRN] : slow request
>>>> 31.531591 seconds old, received at 2015-10-20 08:40:05.146917:
>>>> osd_op(client.158903.1:343217144 rb.0.25cf8.238e1f29.00000000a044 [read
>>>> 2113536~262144] 3.ae9968ea RETRY) v4 currently reached pg
>>>> 2015-10-20 08:40:36.678599 7f361585c700  0 log [WRN] : slow request
>>>> 31.531551 seconds old, received at 2015-10-20 08:40:05.146957:
>>>> osd_op(client.158903.1:343232634 ekessler-default.rbd [watch 35~0]
>>>> 3.e4bd50ea) v4 currently reached pg
>>>>
>>>> Note's online suggest this is an issue with the journal and that it may
>>>> be possible to export and rebuild thepg.  I don't have firefly.
>>>>
>>>> https://ceph.com/community/incomplete-pgs-oh-my/
>>>>
>>>> Interestingly, pg 3.ea appears to be complete on osd.11 (the secondary)
>>>> but missing entirely on osd.30 (the primary).
>>>>
>>>> on osd.33 (primary):
>>>>
>>>> crowbar@da0-36-9f-0e-2b-88:~$ du -sk
>>>> /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>>> 0       /var/lib/ceph/osd/ceph-30/current/3.ea_head/
>>>>
>>>> on osd.11 (secondary):
>>>>
>>>> crowbar@da0-36-9f-0e-2b-40:~$ du -sh
>>>> /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>>> 63G     /var/lib/ceph/osd/ceph-11/current/3.ea_head/
>>>>
>>>> This makes some sense since, my disk drive rebuilding activity
>>>> reformatted the primary osd.30.  It also gives me some hope that my data
>>>> is not lost.
>>>>
>>>> I understand incomplete means problem with journal, but is there a way
>>>> to dig deeper into this or possible to get the secondary's data to take
>>>> over?
>>> If you're running an older version of Ceph (Firefly or earlier,
>>> maybe?), "incomplete" can also mean "not enough replicas". It looks
>>> like that's what you're hitting here, if osd.11 is not reporting any
>>> issues. If so, simply setting the min_size on this pool to 1 until the
>>> backfilling is done should let you get going.
>>> -Greg

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to