Hi Sam,
Yesterday there was one PG down in our cluster and I am confused by the PG
state, I am not sure if it is a bug (or an issue has been fixed as I see a
couple of related fixes in giant), it would be nice you can help to take a look.
Here is what happened:
We are using EC pool with 8 data chunks and 3 code chunks, saying the PG has
up/acting set as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], there was one OSD in the
set down and up, so that it triggered PG recovering. However, when doing
recover, the primary OSD crash as due to a corrupted file chunk, then another
OSD become primary, start recover and crashed, and so on so forth until there
are 4 OSDs down in the set and the PG is marked down.
After that, we left the OSD having corrupted data down and started all other
crashed OSDs, we expected the PG could become active, however, the PG is still
down with the following query information:
{ "state": "down+remapped+inconsistent+peering",
"epoch": 4469,
"up": [
377,
107,
328,
263,
395,
467,
352,
475,
333,
37,
380],
"acting": [
2147483647,
107,
328,
263,
395,
2147483647,
352,
475,
333,
37,
380],
...
377]}],
"probing_osds": [
"37(9)",
"107(1)",
"263(3)",
"328(2)",
"333(8)",
"352(6)",
"377(0)",
"380(10)",
"395(4)",
"467(5)",
"475(7)"],
"blocked": "peering is blocked due to down osds",
"down_osds_we_would_probe": [
8],
"peering_blocked_by": [
{ "osd": 8,
"current_lost_at": 0,
"comment": "starting or marking this osd lost may let us
proceed"}]},
{ "name": "Started",
"enter_time": "2014-11-12 10:12:23.067369"}],
}
Here osd.8 is the one having corrupted data.
The way we worked around this issue is to set norecover and start osd.8, get
that PG active and then removed the object (via rados), unset norecover and
things become clean again. But the most confusing part is that even we only
left osd.8 down, the PG couldn't become active.
We are using firefly v0.80.4.
Thanks,
Guang --
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html