On 02/17/2018 12:48 PM, David Zafman wrote:

The commits below came after v12.2.2 and may impact this issue. When a pg is active+clean+inconsistent means that scrub has detected issues with 1 or more replicas of 1 or more objects .  An unfound object is a potentially temporary state in which the current set of available OSDs doesn't allow an object to be recovered/backfilled/repaired.  When the primary OSD restarts, any unfound objects ( an in memory structure) are reset so that the new set of peered OSDs can determine again what objects are unfound.

I'm not clear in this scenario whether recovery failed to start, recovery hung before due to a bug or if recovery stopped (as designed) because of the unfound object.  The new recovery_unfound and backfill_unfound states indicates that recovery has stopped due to unfound objects.

Thanks for your comments David. I could certainly enable any additional logging that might help to clarify what's going on here - perhaps on the primary OSD for a given pg?

I am still having a hard time understanding why these objects repeatedly get flagged as unfound, when they are downloadable and contain correct data whenever they are not in this state. It is a 4+2 EC pool, so I would think it possible to reconstruct any missing EC chunks.

It's an extensive problem; while I have been focusing on examining a couple of specific pgs, the pool in general is showing 2410 pgs inconsistent (out of 4096).

Graham Allan
Minnesota Supercomputing Institute - g...@umn.edu
ceph-users mailing list

Reply via email to