Re: Idea for reducing disk IO on tagging operations
Hello, sorry for the late reply to this, but I was on vacation. Anything, I believe I might be able to contribute something to this discussion, which even resulted in some code. * On Sun, Mar 20, 2005 at 11:54:32PM + Dr. David Alan Gilbert wrote: OK, my conscience will let me carefully ignore NFS issues given the pain it causes me elsewhere (and I make my mechanism switchable). What happens if I only used the overwrite mechanism if none of the characters being modified crossed a 512 (e.g.) byte boundary offset in the file? Since the spaces were actually written in a previous operation we can assume that the space is allocated and no allocation operation is going to happen at this point (mumble filesystem journalling mumble!). IMHO, here, you are not correct. If I write X times a char Y into a file, I cannot assume that memory for X characters has been allocated. The file system can do some optimizations, compress the file (for example, run-length encoding RLE: First character tells that X times the same character will be written, and the character itself is written afterwards), or anything else. Furthermore, think of so-called sparse-files, which can be rather big - much bigger than your actual medium is itself. Because of this, even a block boundary in the file does not make much sense, IMHO, for the general case, that is, arbitrary file systems. Regards, Spiro. -- Spiro R. Trikaliotis http://cbm4win.sf.net/ http://www.trikaliotis.net/ http://www.viceteam.org/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
Hi, Well, I've had a crack at implementing the optimisation; and attached is a patch which seems to work - but there is at least one nasty hack in it; more about that in a sec. To enable it you need to add: TagOverwriteEnable=yes to the config file in the CVSROOT; without that it should not change behaviour in any way (except adding that as a commented out option with warning to a newly created repository). It won't give you any performance benefit on the first tag, but should give something on subsequent tags. I see some improvement (~15%) but it is variable, on a large repository that doesn't fit in memory on my home machine. It is my first dig into the CVS code base, so I would appreciate (gentle) comments. Now some details; 1) The real nasty hack; this is something that I hadn't thought of (and I don't think anyone else noticed?) in my original description; the permissions on the rcs files is read only so when I need to open them to overwrite I can't - this is a pain; this patch has a gratuitous (and obviously WRONG) hack in of chmod'ing it before the open - I'm open for any suggestions *if* there is a right way of doing this. (This was a pain because it was at the very last stage of the patch that I noticed this!). 2) I don't currently create the dummy ,foo, locking file. 3) I haven't written any docs yet. 4) I needed to get a couple of values out of rcsbuf_getkey and have shoved them in globals for the moment; I was looking for a neater way that wouldn't mean changing all the callers. 5) I'm worried about the right types to use for file offsets in a portable way. (Has anyone tried cvs with rcs files over 2GB?) The patch is against 1.12.9 which is the version my debian happened to have. As I say, suggestions - and experiences welcome. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ diff -ur orig/cvs-1.12.9/ChangeLog cvs-1.12.9/ChangeLog --- orig/cvs-1.12.9/ChangeLog 2004-06-09 15:52:32.0 +0100 +++ cvs-1.12.9/ChangeLog2005-03-24 23:43:48.0 + @@ -1,3 +1,6 @@ +2005-03-24 Dave Gilbert [EMAIL PROTECTED] + * Added fast tagging mechanism; rcs.h/c, parseinfo.c,mkmodules.c + 2004-06-09 Derek Price [EMAIL PROTECTED] * NEWS: Note Stefan Sebastian's security fixes. diff -ur orig/cvs-1.12.9/src/admin.c cvs-1.12.9/src/admin.c --- orig/cvs-1.12.9/src/admin.c 2004-03-22 15:37:34.0 + +++ cvs-1.12.9/src/admin.c 2005-03-27 20:39:38.0 +0100 @@ -792,7 +792,7 @@ || (rev = RCS_tag2rev (rcs, p))) /* tag2rev may exit */ { RCS_check_tag (tag); /* exit if not a valid tag */ - RCS_settag (rcs, tag, rev); + RCS_settag (rcs, tag, rev, NULL); free (rev); } else diff -ur orig/cvs-1.12.9/src/commit.c cvs-1.12.9/src/commit.c --- orig/cvs-1.12.9/src/commit.c2004-06-09 15:52:37.0 +0100 +++ cvs-1.12.9/src/commit.c 2005-03-27 20:39:45.0 +0100 @@ -2144,7 +2144,7 @@ head = RCS_getversion (rcs, NULL, NULL, 0, (int *) NULL); magicrev = RCS_magicrev (rcs, head); - retcode = RCS_settag (rcs, tag, magicrev); + retcode = RCS_settag (rcs, tag, magicrev, NULL); RCS_rewrite (rcs, NULL, NULL); free (head); diff -ur orig/cvs-1.12.9/src/import.c cvs-1.12.9/src/import.c --- orig/cvs-1.12.9/src/import.c2004-04-27 22:08:40.0 +0100 +++ cvs-1.12.9/src/import.c 2005-03-27 20:39:59.0 +0100 @@ -770,7 +770,7 @@ if (noexec) return (0); -if ((retcode = RCS_settag(rcs, vtag, vbranch)) != 0) +if ((retcode = RCS_settag(rcs, vtag, vbranch, NULL)) != 0) { ierrno = errno; fperrmsg (logfp, 0, retcode == -1 ? ierrno : 0, @@ -792,7 +792,7 @@ vers = Version_TS (finfo, NULL, vtag, NULL, 1, 0); for (i = 0; i targc; i++) { - if ((retcode = RCS_settag (rcs, targv[i], vers-vn_rcs)) == 0) + if ((retcode = RCS_settag (rcs, targv[i], vers-vn_rcs, NULL)) == 0) RCS_rewrite (rcs, NULL, NULL); else { diff -ur orig/cvs-1.12.9/src/mkmodules.c cvs-1.12.9/src/mkmodules.c --- orig/cvs-1.12.9/src/mkmodules.c 2004-05-29 05:48:52.0 +0100 +++ cvs-1.12.9/src/mkmodules.c 2005-03-24 23:43:38.0 + @@ -349,6 +349,23 @@ # Be warned that these strings could be disabled in any new version of CVS.\n, UseNewInfoFmtStrings=yes\n, #endif /* SUPPORT_OLD_INFO_FMT_STRINGS */ +# Options relating to the Tag overwrite optimisation\n, +# ** WARNING ** Only enable this after reading the appropriate documentation\n, +# since it can cause
Re: Idea for reducing disk IO on tagging operations
I followed this discussion only loosely and kept silent because I suspect someone will shoot me to pieces for the complaint I'm about to make, but now that we're to the stage of actual implementation, I guess I'd like to say this anyway... I have reservations about any system that makes whitespace significant in a text file. I can make an exception for indent levels, as used by Python, because these are visible and errors are obvious without resorting to odd tactics like hex editors, vi's :list command, etc. I say I expect to be shot down because, of course, the proper theory is that all in a CVS file is opaque and should not be depended upon by CVS users. True in theory, but in practice, sometimes I've found it much quicker to fix, say, a log mistake by hand in a CVS file (yes I'm aware of the section specifically addressing this in Cederqvist). The current danger to editing the file directly is real, but I think much more easily avoided now than if we come to require a lot of consecutive lines of just whitespace which, if mangled, could cause overwrites of other data and suchlike. I'll leave the message that spurred this commentary below for reference, but I top-posted because it's really a new subject (well that, and I suppose I have a bias against bottom-posting: I'm blind, and bottom-posting makes me read through the whole blooming family tree of messages every time a new one comes along grin--but I digress, and I'm not about to try to change the list's standard on this). To the author of the idea being discussed, my apologies for weighing in so tardily. I guess I'm guilty of having complacently assumed nothing would happen. I see your new behavior is made optional and a user choice, which I appreciate. On Mon, Mar 28, 2005 at 07:06:36PM +0100, Dr. David Alan Gilbert wrote: Hi, Well, I've had a crack at implementing the optimisation; and attached is a patch which seems to work - but there is at least one nasty hack in it; more about that in a sec. To enable it you need to add: TagOverwriteEnable=yes to the config file in the CVSROOT; without that it should not change behaviour in any way (except adding that as a commented out option with warning to a newly created repository). It won't give you any performance benefit on the first tag, but should give something on subsequent tags. I see some improvement (~15%) but it is variable, on a large repository that doesn't fit in memory on my home machine. It is my first dig into the CVS code base, so I would appreciate (gentle) comments. Now some details; 1) The real nasty hack; this is something that I hadn't thought of (and I don't think anyone else noticed?) in my original description; the permissions on the rcs files is read only so when I need to open them to overwrite I can't - this is a pain; this patch has a gratuitous (and obviously WRONG) hack in of chmod'ing it before the open - I'm open for any suggestions *if* there is a right way of doing this. (This was a pain because it was at the very last stage of the patch that I noticed this!). 2) I don't currently create the dummy ,foo, locking file. 3) I haven't written any docs yet. 4) I needed to get a couple of values out of rcsbuf_getkey and have shoved them in globals for the moment; I was looking for a neater way that wouldn't mean changing all the callers. 5) I'm worried about the right types to use for file offsets in a portable way. (Has anyone tried cvs with rcs files over 2GB?) The patch is against 1.12.9 which is the version my debian happened to have. As I say, suggestions - and experiences welcome. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ diff -ur orig/cvs-1.12.9/ChangeLog cvs-1.12.9/ChangeLog --- orig/cvs-1.12.9/ChangeLog 2004-06-09 15:52:32.0 +0100 +++ cvs-1.12.9/ChangeLog2005-03-24 23:43:48.0 + @@ -1,3 +1,6 @@ +2005-03-24 Dave Gilbert [EMAIL PROTECTED] + * Added fast tagging mechanism; rcs.h/c, parseinfo.c,mkmodules.c + 2004-06-09 Derek Price [EMAIL PROTECTED] * NEWS: Note Stefan Sebastian's security fixes. diff -ur orig/cvs-1.12.9/src/admin.c cvs-1.12.9/src/admin.c --- orig/cvs-1.12.9/src/admin.c 2004-03-22 15:37:34.0 + +++ cvs-1.12.9/src/admin.c 2005-03-27 20:39:38.0 +0100 @@ -792,7 +792,7 @@ || (rev = RCS_tag2rev (rcs, p))) /* tag2rev may exit */ { RCS_check_tag (tag); /* exit if not a valid tag */ - RCS_settag (rcs, tag, rev); + RCS_settag (rcs, tag, rev, NULL); free (rev); } else diff -ur orig/cvs-1.12.9/src/commit.c cvs-1.12.9/src/commit.c ---
Re: Idea for reducing disk IO on tagging operations
* Doug Lee ([EMAIL PROTECTED]) wrote: I followed this discussion only loosely and kept silent because I suspect someone will shoot me to pieces for the complaint I'm about to make, but now that we're to the stage of actual implementation, I guess I'd like to say this anyway... Hey that's OK. I have reservations about any system that makes whitespace significant in a text file. I can make an exception for indent levels, as used by Python, because these are visible and errors are obvious without resorting to odd tactics like hex editors, vi's :list command, etc. Let me make it clear that this patch *in no way* makes whitespace significant; in actual fact it only works because it isn't significant. What it does is put a glob of whitespace in when it is convenient; nothing changes in the parsing or anything - so just like before that whitespace is completely ignored. The trick is that when it comes to add a tag it checks to see if there is spare white space and if so overwrites it; if you removed the white space or otherwise fettled with the file that is fine; it won't perform the optimisation. Indeed this means that an existing cvs client can quite happily read a repository which has had my patch inflicted on it. (The existing cvs code that rewrites the file will remove any excess white space you added up there anyway.) Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
RE: Idea for reducing disk IO on tagging operations
[top posting as a courtesy for Doug] I haven't examined the patch, so I don't know how closely the implementation matches the proposal, but if I understand the proposed changes, whitespace is still insignificant, there's just more of it added as a buffer, as an optimization to improve speed when applying tags. If the implementation is carried out correctly, then the RCS file will still be compatible with other RCS-compatible software, some of which could legitimately strip out the extra whitespace (unless the general practise is to leave whitespace alone). My only concern around this patch is to make sure robustness has not been adversely affected. I don't know enough about third-party add-ons to know for sure, or to comment on their use. I also like the fact that the change is optional, so that it can be disabled if any particular platform is incompatible with the changes. Doug Lee wrote: I have reservations about any system that makes whitespace significant in a text file. I can make an exception for indent levels, as used by Python, because these are visible and errors are obvious without resorting to odd tactics like hex editors, vi's :list command, etc. I say I expect to be shot down because, of course, the proper theory is that all in a CVS file is opaque and should not be depended upon by CVS users. -- Jim Hyslop Senior Software Designer Leitch Technology International Inc. ( http://www.leitch.com ) Columnist, C/C++ Users Journal ( http://www.cuj.com/experts ) ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
On Mon, Mar 28, 2005 at 02:12:56PM -0500, Jim.Hyslop wrote: [top posting as a courtesy for Doug] Thanks :) I just hope I don't cause a mess by that comment, which I suppose was fuelled as much by lack of lunch as by anything. :-) I haven't examined the patch, so I don't know how closely the implementation matches the proposal, but if I understand the proposed changes, whitespace is still insignificant, there's just more of it added as a buffer, as an optimization to improve speed when applying tags. If the implementation is carried out correctly, then the RCS file will still be compatible with other RCS-compatible software, some of which could legitimately strip out the extra whitespace (unless the general practise is to leave whitespace alone). You are correct, according to a message the author just sent me. Consider my complaint dismissed, and thanks for the explanations. My only concern around this patch is to make sure robustness has not been adversely affected. I don't know enough about third-party add-ons to know for sure, or to comment on their use. I also like the fact that the change is optional, so that it can be disabled if any particular platform is incompatible with the changes. Doug Lee wrote: I have reservations about any system that makes whitespace significant in a text file. I can make an exception for indent levels, as used by Python, because these are visible and errors are obvious without resorting to odd tactics like hex editors, vi's :list command, etc. I say I expect to be shot down because, of course, the proper theory is that all in a CVS file is opaque and should not be depended upon by CVS users. -- Jim Hyslop Senior Software Designer Leitch Technology International Inc. ( http://www.leitch.com ) Columnist, C/C++ Users Journal ( http://www.cuj.com/experts ) -- Doug Lee [EMAIL PROTECTED]http://www.dlee.org Bartimaeus Group [EMAIL PROTECTED] http://www.bartsite.com It is difficult to produce a television documentary that is both incisive and probing when every twelve minutes one is interrupted by dancing rabbits singing about toilet paper. --Rod Serling ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
* Jim Hyslop ([EMAIL PROTECTED]) wrote: Dr. David Alan Gilbert wrote: 2) I could do with a better under standing of the directory locks; pointers? I've read the top of lock.c but it still doesn't tell me enough; for example there seem to be multiple lock files used - but then surely the creation of them isn't atomic? Or is there one lock file used for both reading and writing? The locking process is explained in the manual, at https://www.cvshome.org/docs/manual/cvs-1.11.19/cvs_2.html#SEC17 Thanks Jim for pointing me at that (I'd had a good search through the FAQ rather than the manual). (and to Paul - apologies if I misquoted in that last email) OK; this convinces me that I don't need to worry about cvs reading my file while it is being modified. Together with the restriction of me only performing my trick if the write is entirely within a block then I feel reasonably safe. I'm going to have a crack at making this optimisation and will forward a copy here for discussion when I've done it. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
* Mark D. Baushke ([EMAIL PROTECTED]) wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul Sander [EMAIL PROTECTED] writes: Actually, if you look closely, I believe that CVS will not do read-only RCS operations if a CVS write-lock exists for the directory. Of course, ViewCVS and CVSweb do it all the time as do many of the other add-ons. I'm getting more worried about this one for 2 seperate reasons: 1) There is talk of cvs -n for diff and the like which seems to suggest it ignores locks. 2) I could do with a better under standing of the directory locks; pointers? I've read the top of lock.c but it still doesn't tell me enough; for example there seem to be multiple lock files used - but then surely the creation of them isn't atomic? Or is there one lock file used for both reading and writing? There's also the interrupt issue: Killing an update before it completes leaves the RCS file corrupt. You'd have to build in some kind of crash recovery. But RCS already has that by way of the comma file, which can simply be deleted. Other crash recovery algorithms usually involve transaction logs that can be reversed and replayed, or the creation of backup copies. None of these are more efficient than the existing RCS update protocol. Agreed. This is a very big deal. Actually I'm becoming less worried by this; I'm failing to see any way that a single system call write() to a block not crossing a block boundary could partially fail; but I'm up for suggestions. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
Dr. David Alan Gilbert wrote: 2) I could do with a better under standing of the directory locks; pointers? I've read the top of lock.c but it still doesn't tell me enough; for example there seem to be multiple lock files used - but then surely the creation of them isn't atomic? Or is there one lock file used for both reading and writing? The locking process is explained in the manual, at https://www.cvshome.org/docs/manual/cvs-1.11.19/cvs_2.html#SEC17 -- Jim ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Todd Denniston [EMAIL PROTECTED] writes: This reminds me of conversations held earlier in the list. I think several of them ended with something to the effect of 'putting the /tmp/ or LockDir which cvs uses on a RAM disk should make the whole thing _much_ faster'. Yes. Our testing also found that using the SAN was faster than the Solaris 9 RAM disk solution. So, that is what we are using these days. If anyone is really serious about tuning and improving the performance of their own CVS installation, there should be nothing stopping them - From tweaking the sources for experimentation and running their own tests. If you instrument CVS and find a particular hotspot and then find a way to make that area more efficient without hurting the rest of the system, let the bug-cvs@gnu.org list know your results and your patch and you will likely find changes considered for inclusion in future versions of CVS. -- Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQFCPvBl3x41pRYZE/gRAk3eAJ0ageE1b5X67SuvqubxXKXHUPHjIACgikBw Muqs+EjIyczfddfr7EZT8Aw= =2X8W -END PGP SIGNATURE- ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
Tagging in particular is slow and I don't think cpu or ram is the issue (it is a dual xeon with 3GB of RAM). ... I'll be shot as a heretic, but the real solution is that tags don't belong in the ,v files in the first place. IMO the only useful purpose of tags is to snapshot the entire code base in some way so that you can roll back to it (or diff against it). Tagging individual files doesn't to anything to help you understand their relation over time to the rest of the code base. The entire system tag could be accomplished in O(# files) time (rather than O(repository disk size) by simply creating a manifest of each file in the repository and its version at the time of the snapshot. The snapshot/manifests become entities in their own right, so you should be able to do things like list available snapshots, see when they were created, add meta information to the snapshot ... -Tony Aiuto ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Idea for reducing disk IO on tagging operations
Hi, I maintain a system that is used to hold a rather large CVS repository (~1GB give or take) which could do with being faster. Tagging in particular is slow and I don't think cpu or ram is the issue (it is a dual xeon with 3GB of RAM). My suspicion is that at least one of the problems is that when a tag is added most of the rcs files are rewritten giving a sudden large amount of data that must be written to disc. So - here are my questions/ideas - I'd appreciate comments to tell me whether I'm on the right lines: 1) As I understand it the tag data is the first of the 3 main data structures in the RCS file (tag, comments, diffs) and that when I do pretty much any CVS operation I rewrite the whole file - is this correct? 2) White space appears to be irrelevent in RCS files; so adding arbitrary amounts in between sections should leave files still fully compatible with existing RCS/cvs tools. 3) So the idea is that when I add a tag I add a bunch of white space after the tag (lets say 1KB of spaces split into 64 byte lines or similar); when I come to add the next tag I check if there is plenty of white space, if there is then instead of rewriting the file I just overwrite the white space with my new tag data; if there is no space then as I rewrite the file I add another lump of white space. 4) Whether dummy white space is added and how much is controlled by the existing size of the RCS file; so an RCS file that is only a few KB wont have any space added; that way this mechanism doesn't slow down/bloat small repositories. The amount of white space might be chosen to align data structures with disk block boundaries. 5) My main concern is to do with concurrency/consistency requirements; is the file rewrite essential to ensure consistency, or is the locking that is carried out sufficient? Does this make sense? Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dr. David Alan Gilbert [EMAIL PROTECTED] writes: So - here are my questions/ideas - I'd appreciate comments to tell me whether I'm on the right lines: 1) As I understand it the tag data is the first of the 3 main data structures in the RCS file (tag, comments, diffs) and that when I do pretty much any CVS operation I rewrite the whole file - is this correct? CVS write operations on a foo.c,v repository file will write ,foo.c, and then when the write operation is successful and without any errors, it does a rename (,foo.c,, foo.c,v); to make the new version the official version. While the ,foo.c, file exists, RCS commands will consider the file locked. It is desirable to use RCS write semanitcs as many other tools out there (cf, ViewCVS) use RCS on the repository and want to obey RCS locking. 2) White space appears to be irrelevent in RCS files; so adding arbitrary amounts in between sections should leave files still fully compatible with existing RCS/cvs tools. Tools such as CVSup by default will canonicalize the whitespace between sections (although this may be configured). So, yes, whitespace is mostly irelevent between sections. 3) So the idea is that when I add a tag I add a bunch of white space after the tag (lets say 1KB of spaces split into 64 byte lines or similar); when I come to add the next tag I check if there is plenty of white space, if there is then instead of rewriting the file I just overwrite the white space with my new tag data; if there is no space then as I rewrite the file I add another lump of white space. This has the potential to more easily corrupt the RCS file if the operation is interrupted for any reason. 4) Whether dummy white space is added and how much is controlled by the existing size of the RCS file; so an RCS file that is only a few KB wont have any space added; that way this mechanism doesn't slow down/bloat small repositories. The amount of white space might be chosen to align data structures with disk block boundaries. 5) My main concern is to do with concurrency/consistency requirements; is the file rewrite essential to ensure consistency, or is the locking that is carried out sufficient? Does this make sense? It would be more robust to enhance CVS to use an external database for tagging information instead of putting the tagging information into the RCS files directly than to rewrite parts of the RCS file and hope that the operation didn't corrupt the file along the way. You may wish to consider looking at Meta-CVS as I believe that Kaz keeps a lot of the branching information outside of the RCS files already. See http://users.footprints.net/~kaz/mcvs.html for more details on Meta-CVS. Good luck, -- Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQFCPaS23x41pRYZE/gRAjULAJ9RzLHw+gUDoMCbF0zjgmStBJIT9gCfUU83 K/TZMZdXbJx+BWVFaXGS0Jk= =fz6n -END PGP SIGNATURE- ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
[Resend: I sent it with the wrong 'from' address - apologies if you get both] * Mark D. Baushke ([EMAIL PROTECTED]) wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Mark, Thanks for your reply. Dr. David Alan Gilbert [EMAIL PROTECTED] writes: So - here are my questions/ideas - I'd appreciate comments to tell me whether I'm on the right lines: 1) As I understand it the tag data is the first of the 3 main data structures in the RCS file (tag, comments, diffs) and that when I do pretty much any CVS operation I rewrite the whole file - is this correct? CVS write operations on a foo.c,v repository file will write ,foo.c, and then when the write operation is successful and without any errors, it does a rename (,foo.c,, foo.c,v); to make the new version the official version. While the ,foo.c, file exists, RCS commands will consider the file locked. It is desirable to use RCS write semanitcs as many other tools out there (cf, ViewCVS) use RCS on the repository and want to obey RCS locking. OK, if I create a dummy ,foo.c, before modifying (or create a hardlink with that name to foo.c,v ?) would that be sufficient? Or perhaps create the ,foo,c, as I normally would - but if I can use this overwrite trick on the original then I just delete the ,foo.c, file. Is the problem that things are allowed to read the original foo.c,v while you are creating the new version? be configured). So, yes, whitespace is mostly irelevent between sections. Great. 3) So the idea is that when I add a tag I add a bunch of white space after the tag (lets say 1KB of spaces split into 64 byte lines or similar); when I come to add the next tag I check if there is plenty of white space, if there is then instead of rewriting the file I just overwrite the white space with my new tag data; if there is no space then as I rewrite the file I add another lump of white space. This has the potential to more easily corrupt the RCS file if the operation is interrupted for any reason. The act of rewriting adding extra space would be performed using the existing mechanism (with just some extra add space created in RCS_rewrite); so that can't be a problem. So the issue is what happens if the interrupt occurs as I'm overwriting the white space to add a tag; hmm yes; is it possible to guard against this by using a single call to write(2) for that? Is that the problem you are thinking of? It would be more robust to enhance CVS to use an external database for tagging information instead of putting the tagging information into the RCS files directly than to rewrite parts of the RCS file and hope that the operation didn't corrupt the file along the way. Sure, seperating the tagging data out is much neater; but what I was looking for here was a simple speed up which didn't require anything extra and would be fully compatible with existing tools. You may wish to consider looking at Meta-CVS as I believe that Kaz keeps a lot of the branching information outside of the RCS files already. See http://users.footprints.net/~kaz/mcvs.html for more details on Meta-CVS. If I was changing to another tool then I'd have a much larger set of tools to consider (e.g. subversion) but I'd rather stick with plain CVS if I can - I've got clients on lots of (weird) OSs that work via pserver and an infinite number of scripts built around CVS. Thanks for the reply, Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
Everything that Mark says is true. I'll add that some shops optimize their read operations under certain conditions, and such optimizations would break if the RCS files are updated in-place. What happens is that, if the version of every file can be identified in advance (using version number, tag, or branch/timestamp pair) then they can invoke RCS directly to fetch file versions, read metadata, and so on. This sidesteps CVS' overhead and can increase performance by as much as 50%. Such operations will also succeed and not interfere with write operations to the repository, such as commits and the creation of new tags. Moving tags or using cvs admin may sometimes cause race conditions that produce incorrect results, but that all depends on the nature of the changes being made at the time and how the readable versions have been identified. The reason that such an optimization works is because RCS rewrites the RCS file updates into the lock file, filesystem semantics always keep the complete RCS file intact while it's being read, and pre-existing data in the RCS file are not changed during write operations (except for those race conditions I've identified above, which can be avoided). On Mar 20, 2005, at 8:28 AM, [EMAIL PROTECTED] wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dr. David Alan Gilbert [EMAIL PROTECTED] writes: So - here are my questions/ideas - I'd appreciate comments to tell me whether I'm on the right lines: 1) As I understand it the tag data is the first of the 3 main data structures in the RCS file (tag, comments, diffs) and that when I do pretty much any CVS operation I rewrite the whole file - is this correct? CVS write operations on a foo.c,v repository file will write ,foo.c, and then when the write operation is successful and without any errors, it does a rename (,foo.c,, foo.c,v); to make the new version the official version. While the ,foo.c, file exists, RCS commands will consider the file locked. It is desirable to use RCS write semanitcs as many other tools out there (cf, ViewCVS) use RCS on the repository and want to obey RCS locking. 2) White space appears to be irrelevent in RCS files; so adding arbitrary amounts in between sections should leave files still fully compatible with existing RCS/cvs tools. Tools such as CVSup by default will canonicalize the whitespace between sections (although this may be configured). So, yes, whitespace is mostly irelevent between sections. 3) So the idea is that when I add a tag I add a bunch of white space after the tag (lets say 1KB of spaces split into 64 byte lines or similar); when I come to add the next tag I check if there is plenty of white space, if there is then instead of rewriting the file I just overwrite the white space with my new tag data; if there is no space then as I rewrite the file I add another lump of white space. This has the potential to more easily corrupt the RCS file if the operation is interrupted for any reason. 4) Whether dummy white space is added and how much is controlled by the existing size of the RCS file; so an RCS file that is only a few KB wont have any space added; that way this mechanism doesn't slow down/bloat small repositories. The amount of white space might be chosen to align data structures with disk block boundaries. 5) My main concern is to do with concurrency/consistency requirements; is the file rewrite essential to ensure consistency, or is the locking that is carried out sufficient? Does this make sense? It would be more robust to enhance CVS to use an external database for tagging information instead of putting the tagging information into the RCS files directly than to rewrite parts of the RCS file and hope that the operation didn't corrupt the file along the way. You may wish to consider looking at Meta-CVS as I believe that Kaz keeps a lot of the branching information outside of the RCS files already. See http://users.footprints.net/~kaz/mcvs.html for more details on Meta-CVS. Good luck, -- Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQFCPaS23x41pRYZE/gRAjULAJ9RzLHw+gUDoMCbF0zjgmStBJIT9gCfUU83 K/TZMZdXbJx+BWVFaXGS0Jk= =fz6n -END PGP SIGNATURE- ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs -- Paul Sander | When a true genius appears in the world, you may [EMAIL PROTECTED] | know him by this sign: that all the dunces are in | confederacy against him. -- Jonathan Swift, writer. ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dr. David Alan Gilbert [EMAIL PROTECTED] writes: * Mark D. Baushke ([EMAIL PROTECTED]) wrote: Hi Mark, Thanks for your reply. Dr. David Alan Gilbert [EMAIL PROTECTED] writes: So - here are my questions/ideas - I'd appreciate comments to tell me whether I'm on the right lines: 1) As I understand it the tag data is the first of the 3 main data structures in the RCS file (tag, comments, diffs) and that when I do pretty much any CVS operation I rewrite the whole file - is this correct? CVS write operations on a foo.c,v repository file will write ,foo.c, and then when the write operation is successful and without any errors, it does a rename (,foo.c,, foo.c,v); to make the new version the official version. While the ,foo.c, file exists, RCS commands will consider the file locked. It is desirable to use RCS write semanitcs as many other tools out there (cf, ViewCVS) use RCS on the repository and want to obey RCS locking. OK, if I create a dummy ,foo.c, before modifying (or create a hardlink with that name to foo.c,v ?) would that be sufficient? I would say that it is likely necessary, but may not be sufficient. Or perhaps create the ,foo,c, as I normally would - but if I can use this overwrite trick on the original then I just delete the ,foo.c, file. I am unclear how this lets you perform a speedup. Is the problem that things are allowed to read the original foo.c,v while you are creating the new version? I am given to understand that many of the anicillary tools that surround CVS make use of being able to read a consistent ,v file at all times. 3) So the idea is that when I add a tag I add a bunch of white space after the tag (lets say 1KB of spaces split into 64 byte lines or similar); when I come to add the next tag I check if there is plenty of white space, if there is then instead of rewriting the file I just overwrite the white space with my new tag data; if there is no space then as I rewrite the file I add another lump of white space. This has the potential to more easily corrupt the RCS file if the operation is interrupted for any reason. The act of rewriting adding extra space would be performed using the existing mechanism (with just some extra add space created in RCS_rewrite); so that can't be a problem. Adding extra data to the ,foo.c, file during the normal write operation should not be a problem. So the issue is what happens if the interrupt occurs as I'm overwriting the white space to add a tag; hmm yes; Correct. Depending on the filesystem kind and the level of I/O, your rewrite could impact up to three fileblocks and the directory data. is it possible to guard against this by using a single call to write(2) for that? Not for all possible filesystem types. Is that the problem you are thinking of? Yes. Even worse things can happen in this regard if the filesystem is a 'stateless' one such as an NFS mounted directory (we keep advising folks against using them, but I know for a fact that they are still used). It would be more robust to enhance CVS to use an external database for tagging information instead of putting the tagging information into the RCS files directly than to rewrite parts of the RCS file and hope that the operation didn't corrupt the file along the way. Sure, seperating the tagging data out is much neater; but what I was looking for here was a simple speed up which didn't require anything extra and would be fully compatible with existing tools. And you are finding that existing tools torture the assumptions you are able to make about the CVS repository. FWIW: (In my personal experience) using a SAN solution for your repository storage allows you much better throughput for all write operations in the general case as the SAN can guarentee the writes are okay before the disk actually does it. Optimizing for tagging does not seem very useful to me as we typically do not drop that many tags on our repository. You may wish to consider looking at Meta-CVS as I believe that Kaz keeps a lot of the branching information outside of the RCS files already. See http://users.footprints.net/~kaz/mcvs.html for more details on Meta-CVS. If I was changing to another tool then I'd have a much larger set of tools to consider (e.g. subversion) but I'd rather stick with plain CVS if I can - I've got clients on lots of (weird) OSs that work via pserver and an infinite number of scripts built around CVS. Indeed. Part of the difficulty with CVS development has been worrying about legacy software assumptions. -- Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQFCPfR63x41pRYZE/gRAr5/AKCVOkIlgvWabSYXCJ10JbT6W7tMqACdFQs0 6WWc8Ig8hFISTOJK3IhGUB8= =PW+V -END PGP SIGNATURE-
Re: Idea for reducing disk IO on tagging operations
* Paul Sander ([EMAIL PROTECTED]) wrote: Hi Paul, Thanks for the reply, Everything that Mark says is true. I'll add that some shops optimize their read operations under certain conditions, and such optimizations would break if the RCS files are updated in-place. What happens is that, if the version of every file can be identified in advance (using version number, tag, or branch/timestamp pair) then they can invoke RCS directly to fetch file versions, read metadata, and so on. This sidesteps CVS' overhead and can increase performance by as So are these tricks *never* performed by cvs itself? i.e. would my trick (if I can solve the interrupted write case) be completely safe with any use of cvs as long as you didn't access the files externally? Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
* Mark D. Baushke ([EMAIL PROTECTED]) wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dr. David Alan Gilbert [EMAIL PROTECTED] writes: OK, if I create a dummy ,foo.c, before modifying (or create a hardlink with that name to foo.c,v ?) would that be sufficient? I would say that it is likely necessary, but may not be sufficient. Hmm ok. Or perhaps create the ,foo,c, as I normally would - but if I can use this overwrite trick on the original then I just delete the ,foo.c, file. I am unclear how this lets you perform a speedup. I only create the ,foo.c, file - I don't write anything into it; the existence of the file is enough to act as the RCS lock; if I can do my inplace modification then I delete this file after doing it, if not then I proceed as normal and just write the ,foo.c, file and do the rename as you normally would. Is the problem that things are allowed to read the original foo.c,v while you are creating the new version? I am given to understand that many of the anicillary tools that surround CVS make use of being able to read a consistent ,v file at all times. This is very tricky; I don't think in our case we use any such tools (we might have a cvs/web thing for browsing it, but this is probably not critical); and as long I can guarentee what I do is safe as far as CVS itself is concerned I think I'd be prepared to go for it as a configurable mechanism. So the issue is what happens if the interrupt occurs as I'm overwriting the white space to add a tag; hmm yes; Correct. Depending on the filesystem kind and the level of I/O, your rewrite could impact up to three fileblocks and the directory data. is it possible to guard against this by using a single call to write(2) for that? Not for all possible filesystem types. Is that the problem you are thinking of? Yes. Even worse things can happen in this regard if the filesystem is a 'stateless' one such as an NFS mounted directory (we keep advising folks against using them, but I know for a fact that they are still used). OK, my conscience will let me carefully ignore NFS issues given the pain it causes me elsewhere (and I make my mechanism switchable). What happens if I only used the overwrite mechanism if none of the characters being modified crossed a 512 (e.g.) byte boundary offset in the file? Since the spaces were actually written in a previous operation we can assume that the space is allocated and no allocation operation is going to happen at this point (mumble filesystem journalling mumble!). Sure, seperating the tagging data out is much neater; but what I was looking for here was a simple speed up which didn't require anything extra and would be fully compatible with existing tools. And you are finding that existing tools torture the assumptions you are able to make about the CVS repository. Nod; it is quite painful! FWIW: (In my personal experience) using a SAN solution for your repository storage allows you much better throughput for all write operations in the general case as the SAN can guarentee the writes are okay before the disk actually does it. But when you throw a GB of writes at them in a short time from a tag accross our whole repository they aren't going to be happy - they are going to want to get rid of that backlog of write data ASAP. Optimizing for tagging does not seem very useful to me as we typically do not drop that many tags on our repository. In the company I work for we are very tag heavy, but more importantly it is the tagging that gets in peoples way and places the strain on the write bandwidth of the discs/RAID. Dave -Open up your eyes, open up your mind, open up your code --- / Dr. David Alan Gilbert| Running GNU/Linux on Alpha,68K| Happy \ \ gro.gilbert @ treblig.org | MIPS,x86,ARM,SPARC,PPC HPPA | In Hex / \ _|_ http://www.treblig.org |___/ ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
On Mar 20, 2005, at 3:54 PM, [EMAIL PROTECTED] wrote: * Mark D. Baushke ([EMAIL PROTECTED]) wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dr. David Alan Gilbert [EMAIL PROTECTED] writes: OK, if I create a dummy ,foo.c, before modifying (or create a hardlink with that name to foo.c,v ?) would that be sufficient? I would say that it is likely necessary, but may not be sufficient. Hmm ok. Or perhaps create the ,foo,c, as I normally would - but if I can use this overwrite trick on the original then I just delete the ,foo.c, file. I am unclear how this lets you perform a speedup. I only create the ,foo.c, file - I don't write anything into it; the existence of the file is enough to act as the RCS lock; if I can do my inplace modification then I delete this file after doing it, if not then I proceed as normal and just write the ,foo.c, file and do the rename as you normally would. You're forgetting something: The RCS commands will complete read-only operations on RCS files even in the presence of the comma files owned by other processes. Your update protocol introduces race conditions in which the RCS file is not self-consistent at all times. There's also the interrupt issue: Killing an update before it completes leaves the RCS file corrupt. You'd have to build in some kind of crash recovery. But RCS already has that by way of the comma file, which can simply be deleted. Other crash recovery algorithms usually involve transaction logs that can be reversed and replayed, or the creation of backup copies. None of these are more efficient than the existing RCS update protocol. So the issue is what happens if the interrupt occurs as I'm overwriting the white space to add a tag; hmm yes; Correct. Depending on the filesystem kind and the level of I/O, your rewrite could impact up to three fileblocks and the directory data. is it possible to guard against this by using a single call to write(2) for that? Not for all possible filesystem types. You'd have to guarantee that the write is atomic and flushes results completely to disk, even in the presence of things like power failures. It's hard to make this guarantee given all the buffering that goes on below the write(2) API. Optimizing for tagging does not seem very useful to me as we typically do not drop that many tags on our repository. In the company I work for we are very tag heavy, but more importantly it is the tagging that gets in peoples way and places the strain on the write bandwidth of the discs/RAID. I once built a successful system that tracked desirable configurations by building lists of file/version pairs, then committing and tagging the lists. The lists were built by polling the Entries files in workspaces (and making sure there were no uncommitted changes). This was fast and efficient, and it opens you up to use the optimization I mentioned earlier. And if you rely on floating tags, such lists could track the history of the tags as well. In addition, an algebra can be easily written to manipulate such lists. Combine this with a way to link these lists with your defect tracking system, and you have the tools to build a very good change control system. -- Paul Sander | Lets stick to the new mistakes and get rid of the old [EMAIL PROTECTED] | ones -- William Brown ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Dr. David Alan Gilbert [EMAIL PROTECTED] writes: * Paul Sander ([EMAIL PROTECTED]) wrote: Hi Paul, Thanks for the reply, Everything that Mark says is true. I'll add that some shops optimize their read operations under certain conditions, and such optimizations would break if the RCS files are updated in-place. What happens is that, if the version of every file can be identified in advance (using version number, tag, or branch/timestamp pair) then they can invoke RCS directly to fetch file versions, read metadata, and so on. This sidesteps CVS' overhead and can increase performance by as So are these tricks *never* performed by cvs itself? Never? Hmmm... well, the CVS from cvshome.org will not read a foo.c,v file while the CVS read-lock or a CVS write-lock is owned by another process. The real problem is dealing with filesystem errors while RCS is updating the ,v file. I would not trust that the RCS write manipulations will always fail in a safe manner. i.e. would my trick (if I can solve the interrupted write case) be completely safe with any use of cvs as long as you didn't access the files externally? I am not able to say that it would ever be 'completely safe' to do as you suggest. You would need to greatly harden the failure paths of CVS to ensure that the file being modified is not just discarded in the event of a filesystem error by CVS itself. I would not wish to attempt to do it myself. -- Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQFCPnPk3x41pRYZE/gRAi8hAJkBOVbkrD8oSF7/tn4BzFl6JWY5yQCfSKop 72vIMJsvjAoBlQA0NRhf25E= =dWOz -END PGP SIGNATURE- ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs
Re: Idea for reducing disk IO on tagging operations
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Paul Sander [EMAIL PROTECTED] writes: I only create the ,foo.c, file - I don't write anything into it; the existence of the file is enough to act as the RCS lock; if I can do my inplace modification then I delete this file after doing it, if not then I proceed as normal and just write the ,foo.c, file and do the rename as you normally would. You're forgetting something: The RCS commands will complete read-only operations on RCS files even in the presence of the comma files owned by other processes. Your update protocol introduces race conditions in which the RCS file is not self-consistent at all times. Actually, if you look closely, I believe that CVS will not do read-only RCS operations if a CVS write-lock exists for the directory. Of course, ViewCVS and CVSweb do it all the time as do many of the other add-ons. There's also the interrupt issue: Killing an update before it completes leaves the RCS file corrupt. You'd have to build in some kind of crash recovery. But RCS already has that by way of the comma file, which can simply be deleted. Other crash recovery algorithms usually involve transaction logs that can be reversed and replayed, or the creation of backup copies. None of these are more efficient than the existing RCS update protocol. Agreed. This is a very big deal. Dr. David Alan Gilbert [EMAIL PROTECTED] writes: FWIW: (In my personal experience) using a SAN solution for your repository storage allows you much better throughput for all write operations in the general case as the SAN can guarentee the writes are okay before the disk actually does it. But when you throw a GB of writes at them in a short time from a tag accross our whole repository they aren't going to be happy - they are going to want to get rid of that backlog of write data ASAP. I believe you will find that the performance knee for a commercial SAN that is well provisioned happens when you hit a 2GB of sustained writes. You are more likely to run into problems with bandwidth to the fiberchannel mesh first. For us, I seem to recall that the actual bottleneck is the creation of the /tmp/cvs-server$$ trees for a 'cvs tag' operation. So, you results will also depend on how shallow or deep your module hierarchy runs. Optimizing for tagging does not seem very useful to me as we typically do not drop that many tags on our repository. In the company I work for we are very tag heavy, but more importantly it is the tagging that gets in peoples way and places the strain on the write bandwidth of the discs/RAID. Sure, a conventional RAID can be very expensive to rewrite all of the files. It is certainly possible that a close look at CVS performance bottlenecks may find some places where improvements in throughput could be gained. However, I and not at all certain that your particular suggestion would be the best use of optimization time. Enjoy! -- Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.2.3 (FreeBSD) iD8DBQFCPnkr3x41pRYZE/gRAtu0AJ4qNbP4WSN9C60hZsaBejYwYcbnDACdGsOZ RMw/SnkdG/mGOP2oyrdWnis= =lD1h -END PGP SIGNATURE- ___ Info-cvs mailing list Info-cvs@gnu.org http://lists.gnu.org/mailman/listinfo/info-cvs