On Tue, Jul 31, 2012 at 11:26 AM, Philip Martin <philip.mar...@wandisco.com>wrote:
> Stefan Fuhrmann <stefan.fuhrm...@wandisco.com> writes: > > > On Mon, Jul 30, 2012 at 12:40 PM, Philip Martin > > <philip.mar...@wandisco.com>wrote: > > > >> When the commit process finds a representation in the rep-cache the only > >> sanity check that happens is that the revision must be less than or > >> equal to HEAD. We don't check that the offset is valid: > >> > >> echo foo > foo > >> svnadmin create repo > >> svn import -mm foo file://`pwd`/repo/A/f > >> sqlite3 repo/db/rep-cache.db "update rep_cache set offset = 4" > >> svn import -mm foo file://`pwd`/repo/A/g > >> > >> or that the checksum at that offset matches: > >> > >> echo foo > foo > >> echo bar > bar > >> svnadmin create repo > >> svn import -mm foo file://`pwd`/repo/A/f > >> sqlite3 repo/db/rep-cache.db "update rep_cache set > >> hash='e242ed3bffccdf271b7fbaf34ed72d089537b42f'" > >> svn import -mm bar file://`pwd`/repo/A/g > >> > >> In both cases corruption in the rep-cache leads to corruption in the > >> revision files but that corruption is not detected by commit process > >> even though subsequent checkouts will fail. > >> > > > > Has that kind of corruption been observed in the wild? > > I'm not aware of any reports. I did start looking at this because of a > repository with an incorrect offset referring to an older revision file > but that now appears to be a different problem. > > >> Should we do more sanity checking? We are using rep-cache to discard > >> data supplied by the client on the basis that it is already present in > >> the repository. Should we check that the offset really is a > representation > >> with the expected checksum? > >> > > > > The full verification would look like this: > > * recursively enumerate all noderevs in the rep's revision > > * check that at least one uses the rep > > * read the rep and verify the checksum > > > > This seems quite costly to do during commit - in particular during > > imports and similar mass commit operations. > > On commit we would only want to verify that a revision/offset obtained > from the cache really was a representation. That is hard to truly verify. But from a practical POV, you are right - we can check whether (offset, length) points to something that * fully lies within the respective file * starts with "PLAIN\n", "DELTA\n" or "DELTA $someValidRepReference" * ends with "ENDREP\n" Please do *not* check that offsets lie within a specific range but only within the respective file / pack file. I'm working on an experimental tool that will reorder the elements within pack files and there will be no upper limit for representation offsets of a given revision other than the pack file length itself. > I suppose we might want to > construct the full-text corresponding to the representation to verify > the checksum. Yes. Worst thing that might happen is a crash because we didn't check some buffer size. combined_window_cache will reduce the I/O overhead for frequently checked reps such as file properties. > We do not need to verify the whole revision or the whole > rep-cache. > I only thought that it is hard to verify that a given offset range actually *is* a representation. To verify that, there must be some noderev pointing to it. Thus, we had to enumerate all noderevs until we found a suitable one. -- Stefan^2. -- *Join us this October for Subversion Live 2012<http://www.wandisco.com/svn-live-2012>– 2 full days of training, networking, live demos and more! 25% off before Aug. 10th with discount code “earlybird.” *Certified & Supported Apache Subversion Downloads: http://www.wandisco.com/subversion/download