Re: [ceph-users] 4 PGs stuck inactive

Bryan Morris Tue, 07 Jan 2014 17:29:49 -0800

I'd like to add that I have "blocked requests" that are around 10000 seconds 
old....


Can I just force the pgs online....data or not? 

----- Original Message -----

From: "Bryan Morris" <[email protected]> 
To: [email protected] 
Sent: Tuesday, January 7, 2014 5:44:06 PM 
Subject: Re: [ceph-users] 4 PGs stuck inactive 

Thanks in advance for your help, and time... 

ceph 0.67.4 

It's been a long battle, and I don't really know what I'm doing (first ceph 
cluster.....ever)....so here goes. 

Iv'e been testing with btrfs, been having lots of problems too. But ceph was 
resilient and kept working. (I still think its great) 

everything was running smoothly one day....here is the play by play 


    1. had osd.0(btrfs) and osd.1(btrfs) 
    2. added new osd.2(xfs) 
    3. during rebalance did a reboot on osd.0 
    4. osd.1 self implodes :( (btrfs totally corrupts...its gone..all gone) 
    5. osd.0 comes back online (it took 20 minutes for it to come back up, 
btrfs runs like crap for me) 
    6. 4 pgs stuck, doing lots of googling 
    7. try "ceph pg {pg-id} mark_unfound_lost revert"...says no missing objects 
    8. try "ceph osd lost {OSD ID}" 
    9. rebuild osd.1 on XFS 
    10. rebalance..... 
    11. current setup at this point osd.0(btrfs) osd.1(xfs) osd.2(xfs) pgs are 
still stuck 
    12. try rebooting osd.1 and osd.2 (read somewhere that can help) 
    13. shut down osd.0 planning to move to XFS (im done with btrfs) 
    14. pg query shows "down_osds_we_would_probe" 0 so i try marking osd.0 as 
lost 
    15. pgs still stuck 


From: "Gregory Farnum" <[email protected]> 
To: "Bryan Morris" <[email protected]> 
Cc: [email protected] 
Sent: Tuesday, January 7, 2014 3:50:55 PM 
Subject: Re: [ceph-users] 4 PGs stuck inactive 

Oh, sorry, you did do that. Hrm. 

What osdmap epoch did your lost node (0, I assume) disappear in? What 
version of Ceph are you running? That pg stat isn't making a lot of 
sense to me. 
Software Engineer #42 @ http://inktank.com | http://ceph.com 


On Tue, Jan 7, 2014 at 2:45 PM, Gregory Farnum <[email protected]> wrote: 
> Assuming the one who lost its filesystem is totally gone, mark it 
> lost. That will tell the OSDs to give up on whatever data it might 
> have had and you should be good to go (modulo whatever data you might 
> have lost from only having it on the dead OSD during the reboot). 
> -Greg 
> Software Engineer #42 @ http://inktank.com | http://ceph.com 
> 
> 
> On Tue, Jan 7, 2014 at 1:32 PM, Bryan Morris <[email protected]> wrote: 
>> 
>> I have 4 PGs stuck inactive..... 
>> 
>> I have tried everything I can find online, and could use some help. 
>> 
>> I had 3 OSDs... was in process of rebooting one and another totally crashed 
>> and corrupted its filesystem (BAD). 
>> 
>> So now I have 4 incomplete PGs. 
>> 
>> Things I've tried: 
>> 
>> Rebooting all OSDs. 
>> Running 'ceph pg {pg-id} mark_unfound_lost revert' 
>> Running 'ceph osd lost {OSD ID}' 
>> 
>> 
>> Here is a PG query from one of the 4 PGs 
>> 
>> 
>> 
>> [[{ "state": "down+incomplete", 
>> "epoch": 7994, 
>> "up": [ 
>> 1, 
>> 2], 
>> "acting": [ 
>> 1, 
>> 2], 
>> "info": { "pgid": "4.1ca", 
>> "last_update": "0'0", 
>> "last_complete": "0'0", 
>> "log_tail": "0'0", 
>> "last_backfill": "MAX", 
>> "purged_snaps": "[]", 
>> "history": { "epoch_created": 75, 
>> "last_epoch_started": 6719, 
>> "last_epoch_clean": 6710, 
>> "last_epoch_split": 0, 
>> "same_up_since": 7973, 
>> "same_interval_since": 7973, 
>> "same_primary_since": 7969, 
>> "last_scrub": "6239'25314", 
>> "last_scrub_stamp": "2014-01-05 18:48:01.122717", 
>> "last_deep_scrub": "6235'25313", 
>> "last_deep_scrub_stamp": "2014-01-03 18:47:48.390112", 
>> "last_clean_scrub_stamp": "2014-01-05 18:48:01.122717"}, 
>> "stats": { "version": "0'0", 
>> "reported_seq": "1259", 
>> "reported_epoch": "7994", 
>> "state": "down+incomplete", 
>> "last_fresh": "2014-01-07 14:04:36.269587", 
>> "last_change": "2014-01-07 14:04:36.269587", 
>> "last_active": "0.000000", 
>> "last_clean": "0.000000", 
>> "last_became_active": "0.000000", 
>> "last_unstale": "2014-01-07 14:04:36.269587", 
>> "mapping_epoch": 7971, 
>> "log_start": "0'0", 
>> "ondisk_log_start": "0'0", 
>> "created": 75, 
>> "last_epoch_clean": 6710, 
>> "parent": "0.0", 
>> "parent_split_bits": 0, 
>> "last_scrub": "6239'25314", 
>> "last_scrub_stamp": "2014-01-05 18:48:01.122717", 
>> "last_deep_scrub": "6235'25313", 
>> "last_deep_scrub_stamp": "2014-01-03 18:47:48.390112", 
>> "last_clean_scrub_stamp": "2014-01-05 18:48:01.122717", 
>> "log_size": 0, 
>> "ondisk_log_size": 0, 
>> "stats_invalid": "0", 
>> "stat_sum": { "num_bytes": 0, 
>> "num_objects": 0, 
>> "num_object_clones": 0, 
>> "num_object_copies": 0, 
>> "num_objects_missing_on_primary": 0, 
>> "num_objects_degraded": 0, 
>> "num_objects_unfound": 0, 
>> "num_read": 0, 
>> "num_read_kb": 0, 
>> "num_write": 0, 
>> "num_write_kb": 0, 
>> "num_scrub_errors": 0, 
>> "num_shallow_scrub_errors": 0, 
>> "num_deep_scrub_errors": 0, 
>> "num_objects_recovered": 0, 
>> "num_bytes_recovered": 0, 
>> "num_keys_recovered": 0}, 
>> "stat_cat_sum": {}, 
>> "up": [ 
>> 1, 
>> 2], 
>> "acting": [ 
>> 1, 
>> 2]}, 
>> "empty": 1, 
>> "dne": 0, 
>> "incomplete": 0, 
>> "last_epoch_started": 0}, 
>> "recovery_state": [ 
>> { "name": "Started\/Primary\/Peering", 
>> "enter_time": "2014-01-07 14:04:36.269509", 
>> "past_intervals": [ 
>> { "first": 5221, 
>> "last": 5226, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 0, 
>> 1], 
>> "acting": [ 
>> 0, 
>> 1]}, 
>> { "first": 5227, 
>> "last": 5439, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 5440, 
>> "last": 5441, 
>> "maybe_went_rw": 0, 
>> "up": [ 
>> 1], 
>> "acting": [ 
>> 1]}, 
>> { "first": 5442, 
>> "last": 6388, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 6389, 
>> "last": 6395, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 0], 
>> "acting": [ 
>> 0]}, 
>> { "first": 6396, 
>> "last": 6708, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 6709, 
>> "last": 6717, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1], 
>> "acting": [ 
>> 1]}, 
>> { "first": 6718, 
>> "last": 6721, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 2], 
>> "acting": [ 
>> 1, 
>> 2]}, 
>> { "first": 6722, 
>> "last": 6724, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 2], 
>> "acting": [ 
>> 2]}, 
>> { "first": 6725, 
>> "last": 6737, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 0, 
>> 2], 
>> "acting": [ 
>> 0, 
>> 2]}, 
>> { "first": 6738, 
>> "last": 7946, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 7947, 
>> "last": 7948, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1], 
>> "acting": [ 
>> 1]}, 
>> { "first": 7949, 
>> "last": 7950, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 7951, 
>> "last": 7952, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 0], 
>> "acting": [ 
>> 0]}, 
>> { "first": 7953, 
>> "last": 7954, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 7955, 
>> "last": 7956, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1], 
>> "acting": [ 
>> 1]}, 
>> { "first": 7957, 
>> "last": 7958, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 2], 
>> "acting": [ 
>> 1, 
>> 2]}, 
>> { "first": 7959, 
>> "last": 7966, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 7967, 
>> "last": 7968, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 0], 
>> "acting": [ 
>> 0]}, 
>> { "first": 7969, 
>> "last": 7970, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1, 
>> 0], 
>> "acting": [ 
>> 1, 
>> 0]}, 
>> { "first": 7971, 
>> "last": 7972, 
>> "maybe_went_rw": 1, 
>> "up": [ 
>> 1], 
>> "acting": [ 
>> 1]}], 
>> "probing_osds": [ 
>> 1, 
>> 2], 
>> "down_osds_we_would_probe": [ 
>> 0], 
>> "peering_blocked_by": []}, 
>> { "name": "Started", 
>> "enter_time": "2014-01-07 14:04:36.269462"}]} 
>> 
>> 
>> 
>> CEPH OSD TREE 
>> 
>> ceph osd tree 
>> # id weight type name up/down reweight 
>> -1 4.5 pool default 
>> -4 1.8 host ceph2 
>> 1 1.8 osd.1 up 1 
>> -6 2.7 host ceph3 
>> 2 2.7 osd.2 up 1 
>> 
>> 
>> 
>> _______________________________________________ 
>> ceph-users mailing list 
>> [email protected] 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 4 PGs stuck inactive

Reply via email to