On Wed, Feb 17, 2010 at 08:09, Bert Huijben <b...@qqmail.nl> wrote: >... >> [ ] It is cheap to tee to a checksumming stream when reading a >> stored pristine. The I/O is much slower anyway. > When we are reading the complete stream: Yes. > I would skip verification if we only do a partial read (if possible).
That's my belief, but testing can/should confirm. I imagine an SSD "disk" and a good OS buffer cache might actually be pretty damned fast, which would then lead to a checksum being a "noticable" percentage of reading the stream. Of course, we're usually doing *something* with that stream as we read it, and that invariably is to write it somewhere else (temp file, network, etc). >> [ ] It is expensive to *always* want to tee to a checksumming stream, >> because that defies speed gain when wanting to read just a subsection >> of a pristine, and defies using OS copy without de-/translation to put >> a pristine's copy into the WC. > See previous answer I doubt we read portions all that often. My naive approach would have pristine_read() return a checksumming stream. *IF* the stream is read to completion, then it verifies the result. If the stream is closed early (due to partial-read), then we just don't validate. That said... if "always checksumming" proves to be a bit expensive overall, then we may want to be strategic about *when* to do a checksum. That would imply a flag to pristine_read() asking it to do a checksum as we read. (and for known partial-reads, we just set the flag to FALSE). >> [ ] We don't care about optimizing for reading subsections of a pristine. >> We "always" read the whole pristine anyway. > Not sure. > We certainly don't read the entire file if we are only trying to check if a > file has changed or not. (1st byte in a 2GB file has changed?) Right. I think this kind of goes to "we may want to be strategic about when to do a checksum verification" >> [ ] If the checksum is already known and the data exists in the local >> file system, there is no need to verify the checksum. I'm assuming you mean copying from a source in the local file system, and writing that into the pristine store. Yes, if we have a checksum for that file, then we can just use it since we (invariably) *just* computed it. > That depends. If we are going to read it anyway... > In most cases I agree. The filesystem also has verifications on bit errors. > But you also have sed -i <.....> Not all filesystems do. That is why we record checksums for repository contents. That "one bit error every five years" needs to be detectable. >> [ ] If the checksum is already known and the data exists in the local >> file system, we still have to verify the checksum again via a tempfile >> unless that local file already *is* a tempfile (and is thus reasonably >> protected from accidental modification before we finish our copy). > Agree. Any process can change files in the WC at any time. Even if we > checksummed and then moved we have a chance of failure. Yup. Note that we pretty much *always* copy a source file into a temp area. During that copy, we can checksum and (de)translate the thing. These kinds of file copies occur all over the WC. >... >> [ ] We are totally fine with just checking size (and maybe mtime) of >> pristines locally, no need to *always* do checksumming when reading. > +1. > Just check where it doesn't impact performance. (E.g. incoming updates that > read the whole file anyway) I think a flag to pristine_read() may be the way to go. >> Pristine check. >> Do we need a "fast" check? Note that it doesn't need to say how it's >> implemented; just say whether you think we need a presence check that >> tries >> to be as fast as possible by assuming the pristine store is consistent. >> Greg said there is a use case for that lurking, so I suggest keeping it in >> the design doc but marking it for post-1.7. Yes? Sure. I think the question that comes us is, "am I missing this pristine?" Would have to dig more to remember where/why that comes up. >> Write. >> Julian said: there possibly should be only _write(), a simple API that > takes >> care of the rest internally. >> Greg said: _write() has to go away, it has no way of telling whether the >> write stream was interrupted by error or ended successfully. >> Greg also said: It is fine to write directly to a pristine file location. >> Which are/should be true? Write does have to go away. No getting around that, and I won't support further changes to the stream API to "fix" that issue. We've already made the streams more complicated than I think they should be (the recent: reset, mark, seek, and line functions). They used to just be read/write/close. Grr. >> [ ] We will validate pristines when reading anyway, so we can neglect >> being paranoid about writing to pristine files without losing safety. > We should not write faulty files. If somebody opens the stream he assumes > the file is ok and on some OSs he keeps a lock on the file while he as > reading so nobody can repair it. (And on other OSs the file should never > change while it is open) The schema is designed to *allow* for faulty files. The faulty file will NOT be used because SIZE==null in the database. If the file is successfully written, *then* we store a value into SIZE, indicating the file is present and correct. Of course, if you're overwriting a pristine file, then you better store SIZE==null before you begin writing. >> [ ] It is ok to open a write stream to the final location of the >> pristine file for a given checksum, no need to store in a >> pristine-tempfile first. > I think we should say no here, to guarantee a stable behavior. If the file > is in the pristine store it should be OK. We can't trust the initial write > to always provide the full file and a valid hash of this file. > We should calculate ourselves if it is possible that we receive an > incomplete file. The schema allows for this, so it can/should be just fine. > And calculating ourselves defines that we don't know the final location. Usually, we assemble a file in the temp area that is detranslated. At that time, we can compute the checksum. And then we install the file as a pristine. We might have a file incoming from the repository with a known checksum. We can spool that directly into the pristine store. >... >> [ ] However, we do not provide an API function for that. The only way >> to create a pristine (for now) will be via _tempdir() and _install(). > +1 I see no reason to limit it this way. >> [ ] But we will likely see a need at some point, upon which such API >> will be added. We will change _write() so that it can tell the >> difference between successful and unsuccessful stream closure. -1 >... >> [ ] Callers can also pass any other file locations to _install(), to >> facilitate OS-copying/moving into the pristine store without de-/ >> translation (when the checksum is known beforehand). > At which API level do we trust the callers? This would be a wc_db API, and (thus) trustable within WC. The APIs for pristine management provided as the WC public API... none beyond "read", I think. > >> [ ] ^ ... copying ... >> [ ] ^ ... moving ... > There are not that many operations where this is actually used. Maybe after > a commit? The pristine store is mostly read-only and no apis should deliver > the file locations; just streams. Absolutely never a path. That killed us in wc-1. >> Separate pristine store? >... > So: design needed Yah. Post-1.7, I say. For now, the pristine store and wc.db are tied on a 1:1 basis, at the root of each working copy. >... >> SHA1/MD5 compat. >> [ ] When talking to an old server, we use only MD5 checksums to reference >> the pristine in the pristine store, nevermind hash collisions. If we don't have a SHA1 handy, then I don't mind this. But I think we should store stuff using *only* MD5 or *only* SHA1. Mixing them could end up with duplicates. Not sure if that is a problem, but it feels like it could be. >... >> [ ] When talking to an old server, we use SHA1 to index the pristine >> store and also store the MD5 checksum for server comm. We create the >> SHA1 checksum locally after having verified the MD5 checksum. > +1. This is the original WC-NG plan I wouldn't go *that* far. There was definitely some hand-waving going on. The MD5 in the editor API is a severely complicating item. I had hoped to fix that particular problem with Ev2. >> [ ] When talking to a 1.7 server, we will receive and use only SHA1 >> checksums and don't store any MD5 checksums. > > There is currently no delta editor upgrade for 1.7, so we still have to send > MD5 to our 1.7 server as it is now. And to wrap old editors with new editors > (and reversed) we probably have to send both hashes whenever possible. > (The repository filesystem already stores SHA1 hashes since 1.6 or ?) I think we need to keep MD5 checksums around, in addition to the SHA1 for compat with the Editor v1 APIs. Sigh. >... Cheers, -g