On 10/3/21 12:08, 胡 玮文 wrote:

在 2021年10月4日,00:53,Michael Thomas <[email protected]> 写道:

I recently started getting inconsistent PGs in my Octopus (15.2.14) ceph 
cluster.  I was able to determine that they are all coming from the same OSD: 
osd.143.  This host recently suffered from an unplanned power loss, so I'm not 
surprised that there may be some corruption.  This PG is part of a EC 8+2 pool.

The OSD logs from the PG's primary OSD show this and similar errors from the 
PG's most recent deep scrub:

2021-10-03T03:25:25.969-0500 7f6e6801f700 -1 log_channel(cluster) log [ERR] : 
23.1fa shard 143(1) soid 23:5f8c3d4e:::10000179969.00000168:head : candidate 
had a read error

In attempting to fix it, I first ran 'ceph pg repair 23.1fa' on the PG. This 
accomplished nothing.  Next I ran a shallow fsck on the OSD:

I expect this ‘ceph pg repair’ command could handle this kind of errors. After 
issuing this command, the pg should enter a state like 
“active+clean+scrubbing+deep+inconsistent+repair”, then you wait for the repair 
to finish (this can take hours), and you should be able to recover from the 
inconsistent state. What do you mean by “This accomplished nothing”?

The PG never entered the 'repair' state, nor did anything appear in the primary OSD logs about a request for repair. After more than 24 hours, the PG remained listed as 'inconsistent'.

--Mike
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to