On Wed, May 27, 2015 at 6:35 PM, Julian Foad <julianf...@gmail.com> wrote: > Stefan Fuhrmann wrote: >> Alright. I gave it a bit more thought now. >> >> Whenever we encounter this mismatch, something pretty >> bad likely happened to the repo - such as a failed restore >> attempt. In turn, we can expect those situations to be >> very rare - which means we can afford some disruption >> for the user. >> >> I suggest that we do 3 things: >> >> * log the warning - for future reference, for being picked >> up by monitoring tools etc. > > We already do that.
Oh, absolutely. I just didn't mention it. >> * clear the rep-cache.db > > Clearing the cache and continuing operation may make subsequent > commits much larger than they should be, and there is no easy way to > undo that if it happens. Rep-sharing typically reduces the repo size by 25% (e.g. Apache) to 60% (wordpress, inexperienced users using plain ADD for tags). Assuming that most rep-sharing is relatively local, i.e. over the span of a "few" revisions, e.g. due to catch-up merges between branches, most of the inefficiency will only be temporary. In short: no major impact. > Attempting to clear the rep cache might itself fail in some way, > depending on what kind of corruption has happened to it. It would also > destroy the evidence of what went wrong. That is a good point. Two good points, actually. >> * fail the current commit >> >> That way, we can be quite sure that only valid data gets >> committed. > > Failing the current commit will ensure that no potentially bad (but > undiagnosed) response from the rep cache has already been used in an > earlier part of the transaction. I suppose that's what you're thinking > of. That makes sense to me. Yes that and the rep cache also beging used to validate for the incoming data - even if it is very unlikely that we mess up the server-side SHA1 calculation of the fulltext stream. >> Alternatively, we could block any commit >> (inventing some new repo state) until the admin resolves >> the situation manually. Not sure which one I would prefer. > > I suggest this is the best option, unless we specifically design and > the administrator specifically chooses an option to have higher > availability at the expense of disk space, fault diagnosis, and so on. We could add a "continue-upon-failure" option to the [rep-cache] section in fsfs.conf. Default would be "false". If set to true, commits would not be held off by rep-cache failures but the rep-cache would be disabled. If set to false, the repo goes into a r/o state. >> On top of that, we should handle the other rep-cache.db >> consistency checks (e.g. head vs. rev of latest entry) >> the same way. > > That makes sense. > > I suggest all of this should be treated as a possible future > enhancement, not anything urgent. I agree. In particular because it will require a format bump for putting the "r/o" or "corruption" indicator somewhere. -- Stefan^2.