Hi Sam, Thank you for your precise inspection.
I reviewed the log at the time, and I discovered that the cluster failed a OSD just after I shut the first unit down. Thus as you said, the pg can't finish peering due to the second unit was then shut off suddenly. Much appreciate your advice, but I aim to keep my cluster working when 2 storage nodes are down. The unexpected OSD failed with the following log just at the time I shut the first unit down: 2017-01-10 12:30:07.905562 mon.1 172.20.1.3:6789/0 28484 : cluster [INF] osd.153 172.20.3.2:6810/26796 failed (2 reporters from different host after 20.072026>= grace 20.000000) But that OSD was not dead actually, more likely had slow response to heartbeats. What I think is increasing the osd_heartbeat_grace may somehow mitigate the issue. Sincerely, Craig Chi On 2017-01-11 00:08, Samuel Just<[email protected]>wrote: > { "name": "Started\/Primary\/Peering", "enter_time": "2017-01-10 > 13:43:34.933074", "past_intervals": [ { "first": 75858, "last": 75860, > "maybe_went_rw": 1, "up": [ 345, 622, 685, 183, 792, 2147483647, 2147483647, > 401, 516 ], "acting": [ 345, 622, 685, 183, 792, 2147483647, 2147483647, 401, > 516 ], "primary": 345, "up_primary":345 }, Between 75858 and 75860, 345, 622, > 685, 183, 792, 2147483647, 2147483647, 401, 516 was the acting set. The > current acting set 345, 622, 685, 183, 2147483647, 2147483647, 153, 401, 516 > needs *all 7* of the osds from epochs 75858 through 75860 to ensure that it > has any writes completed during that time. You can make transient situations > like that less of a problem by setting min_size to 8 (though it'll prevent > writes with 2 failures until backfill completes). A possible enhancement for > an EC pool would be to gather the infos from those osds anyway and use that > rule outwrites (if they actually happened, you'd still be stuck). -Sam On > Tue, Jan 10, 20 17 at 5: 36 AM, Craig Chi<[email protected]>wrote:>Hi List,>>I am testing the stability of my Ceph cluster with power failure.>>I brutally powered off 2 Ceph units with each 90 OSDs on it while the client>I/O was continuing.>>Since then, some of the pgs of my cluster stucked in peering>>pgmap v3261136: 17408 pgs, 4 pools, 176 TB data, 5082 kobjects>236 TB used, 5652 TB / 5889 TB avail>8563455/38919024 objects degraded (22.003%)>13526 active+undersized+degraded>3769 active+clean>104 down+remapped+peering>9 down+peering>>I queried the peering pg (all on EC pool with 7+2) and got blocked>information (full query: http://pastebin.com/pRkaMG2h )>>"probing_osds": [>"153(6)",>"183(3)",>"345(0)",>"401(7)",>"516(8)",>"622(1)",>"685(2)">],>"blocked": "peering is blocked due to down osds",>"down_osds_we_would_probe": [>792>],>"peering_blocked_by": [>{>"osd": 792,>"current_lost_at": 0,>"comment": "starting or marking this osd lost may let us>proceed">}>]>>>osd.792 is exactly on one of the unit s I powe red off. And I think the I/O>associated with this pg is paused too.>>I have checked the troubleshooting page on Ceph website (>http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/>), it says that start the OSD or mark it lost can make the procedure>continue.>>I am sure that my cluster was healthy before the power outage occurred. I am>wondering if the power outage really happens in production environment, will>it also freeze my client I/O if I don't do anything? Since I just lost 2>redundancies (I have erasure code with 7+2), I think it should still serve>normal functionality.>>Or if I am doing something wrong? Please give me some suggestions, thanks.>>Sincerely,>Craig Chi>>_______________________________________________>ceph-users mailing list>[email protected]>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
