On Tue, Dec 17, 2013 at 7:47 PM, Eric S. Raymond <e...@thyrsus.com> wrote: > I'm working with Alan Barret now on trying to convert the NetBSD > repositories. They break cvs-fast-export through sheer bulk of > metadata, by running the machine out of core. This is exactly > the kind of huge case that you're talking about. > > Alan and I are going to take a good hard whack at modifying cvs-fast-export > to make this work. Because there really aren't any feasible alternatives. > The analysis code in cvsps was never good enough. cvs2git, being written > in Python, would hit the core limit faster than anything written in C.
Depends on how it organizes its data structures. Have you actually tried running cvs2git on it? I'm not saying you are wrong, but I had similar problems with my custom converter (also written in Python), and solved them by adding multiple passes/phases instead of trying to do too much work in fewer passes. In the end I ended up storing the largest inter-phase data structures outside of Python (sqlite in my case) to save memory. Obviously it cost a lot in runtime, but it meant that I could actually chew through our largest CVS modules without running out of memory. > It is certainly the case that a sufficiently large CVS repo will break > anything, like a star with a mass over the Chandrasekhar limit becoming a > black hole :-) :) True, although it's not the sheer size of the files themselves that is the actual problem. Most of those bytes are (deltified) file data, which you can pretty much stream through and convert to a corresponding fast-export stream of blob objects. The code for that should be fairly straightforward (and should also be eminently parallelizable, given enough cores and available I/O), resulting in a table mapping CVS file:revision pairs to corresponding Git blob SHA1s, and an accompanying (set of) packfile(s) holding said blobs. The hard part comes when trying to correlate the metadata for all the per-file revisions, and distill that into a consistent sequence/DAG of changesets/commits across the entire CVS repo. And then, of course, trying to fit all the branches and tags into that DAG of commits is what really drives you mad... ;-) > The question is how common such supermassive cases are. My own guess is that > the *BSD repos and a handful of the oldest GNU projects are pretty much the > whole set; everybody else converted to Subversion within the last decade. You may be right. At least for the open-source cases. I suspect there's still a considerable number of huge CVS repos within companies' walls... > I find the very idea of writing anything that encourages > non-history-correct conversions disturbing and want no part of it. > > Which matters, because right now the set of people working on CVS lifters > begins with me and ends with Michael Rafferty (cvs2git), s/Rafferty/Haggerty/? > who seems even > less interested in incremental conversion than I am. Unless somebody > comes out of nowhere and wants to own that problem, it's not going > to get solved. Agreed. It would be nice to have something to point to for people that want something similar to git-svn for CVS, but without a motivated owner, it won't happen. ...Johan -- Johan Herland, <jo...@herland.net> www.herland.net -- To unsubscribe from this list: send the line "unsubscribe git" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html