On Thu, Dec 19, 2013 at 10:31 AM, Michael Haggerty <mhag...@alum.mit.edu> wrote:
> On 12/19/2013 02:11 AM, Johan Herland wrote:
>> On Thu, Dec 19, 2013 at 12:44 AM, Michael Haggerty <mhag...@alum.mit.edu>
>>> A correct incremental converter could be done (as long as the CVS users
>>> don't literally change history retroactively) but it would be a lot of work.
>> Although I agree with that sentence as it is stated, I also believe
>> that the parenthesized condition rules out a _majority_ of CVS repo of
>> non-trivial size/history. So even though a correct incremental
>> converter could be built, it would be pretty much useless if it did
>> not gracefully handle rewritten history. And in the face of rewritten
>> history it becomes pretty much impossible to define what a "correct"
>> conversion should even look like (not to mention the difficulty of
>> actually implementing that converter...).
> A correct conversion would, conceptually, take a diff between the old
> CVS history and the new CVS history (I'm talking about the history as a
> whole, not a diff between two changesets), figure out what had changed,
> and then figure out what Git commits to make to effect the same
> conceptual changes in Git-land.
> This means that the final Git history would have to depend not only on
> the current entirety of the CVS history, but also on what the CVS
> history *was* during previous incremental imports and how the tool chose
> to represent that history in Git the previous rounds.
> There is a tradeoff here. The smarter the tool is, the fewer
> restrictions would have to be made on what people can do in CVS. For
> example, it wouldn't be unreasonable to impose a rule that people are
> not allowed to move files within the CVS repository (e.g., to fake
> move-file-with-history) after the CVS <-> Git bridge is in use. (Abuses
> of the history that occurred *before* the first incremental conversion,
> on the other hand, wouldn't be a problem.) If the user of the
> incremental tool has *no* influence on how his colleagues use CVS, then
> the tool would have to be very smart and/or the user would might
> sometimes be forced to do another from-scratch conversion.
Agreed, but I find it quite ugly how the git history will end up
different depending on _when_ the incremental conversion is run. It
means that it will be impossible for two users to create the same Git
repo (matching SHA1s), unless they carefully synchronize all of their
conversion runs (at which point it's much simpler to run a single
conversion and then have both users fetch the result).
There is a continuum here in incremental converters:
At one end - given that you're always going to lose _some_ history -
you can go "screw it! let's not care about history at all!", and do
the fastest possible conversion: check out the current CVS version;
diff that against the previous CVS version; apply the diff to your Git
repo as a single commit. I suspect quite a lot of users would be happy
with this solution - at least as a temporary measure while they wait
for their surrounding organization to do a proper migraiton off CVS.
At the other end - you can realize that the CVS storage format on the
server is simply too lossy, and you can write a proxy or monitor that
intercept CVS operations on the server, and replicate those in a
companion Git repo as soon as they occur in CVS. Whether you write a
CVS server monitor that detects changes to the CVS server files in
real time (using e.g. inotify or similar), or you write a CVS server
proxy that intercepts CVS commands from the user (also forwarding them
to the _real_ CVS server) is an implementation detail[*]. The
important thing is you should end up with is a real-time stream of
changes that can be converted to corresponding changes in a Git repo.
That should give you closest possible picture of what really happens
in a CVS repo, even better than what CVS stores in its on-disk format.
This would allow an organization to provide a (read-only) Git mirror
of their CVS repo.
What we have been discussing in this thread (various strategies for
fixing up broken history in Git) can be considered intermediate points
between the two extremes presented above: You try to recreate as much
history as possible, but realize that you sometimes need to simply
synthesize some fake history in order to make everything fit together.
>> Here are just a couple of things a CVS user can do (and that happened
>> fairly regularly at my previous $dayjob) that would make life
>> difficult for an incremental converter (and that also makes stable
>> output from a non-incremental converter hard to solve in practice):
>> - A user "deletes" $file from $branch by simply removing the $branch
>> symbol on $file (cvs tag -B -d $branch $file). CVS stores no record of
>> this. Many non-incremental importers will see $file as never having
>> existed on $branch. An incremental importer starting from a previously
>> converted state, must somehow deal with that previous state no longer
>> existing from the POV of CVS.
> No problem; the tool could just add a synthetic commit "git rm"ming the
> file from the branch. It wouldn't know *when* the file was deleted, so
> it would have to pick a plausible date between the time of the last
> incremental conversion and the one that discovers that the branch tag
> has been removed from the file. The resulting Git history would contain
> more complete information than CVS's history.
A server proxy/monitor analyzing CVS operations in real time would
know _exactly_ when the file was removed...
>> - A user moves a release tag on a few files to include a late bugfix
>> into an upcoming release (cvs tag -F -r $new_rev $tag $file). There
>> might be no single point in time where the tagged state existed in the
>> repo, it has become a "Frankentag". You could claim user error here,
>> and that such shortcuts should not happen, but that doesn't really
>> prevent it from ever happening. Recreating the tree state of the
>> Frankentag in Git is easy, but what kind of history do you construct
>> to lead up to that tree?
> Frankentags (tags that include file versions that didn't occur
> contemporaneously) can occur even with one-time CVS->Git conversions.
> The only way to handle them is to create a Git branch representing the
> tag and base it at a plausible Git commit, and then (on the branch)
> issue a fixup commit that makes the contents of the branch equal to the
> contents of the CVS branch. This is a problem that cvs2git already handles.
> A hypothetical incremental importer would have to notice the changes in
> the branch contents between the previous conversion and the current one,
> and create commits on the branch to bring it in line with the current
> contents. This is no uglier than what a one-shot conversion already has
> to do.
True, but analyzing CVS operations in real time, you might be able to
recreate the moving (and adding/deleting) of tags as file edits (and
adds/deletes) in the corresponding Git branch.
>> - A modularized project develops code on HEAD, and make regular
>> releases of each module by tagging the files in the module dir with
>> "$modulename-$version". Afterwards a project-wide "stable" tag is
>> moved on that subset of files to include the new module release into
>> the "stable" tag. ("stable" is conceptually a branch, but the CVS
>> mechanism used here is still the tag, since CVS branches cannot
>> "follow" eachother like in Git). This is pretty much the same
>> Frankentag scenario as above, except that in this case it might be
>> considered Best Practice (it was at our $dayjob), and not a
>> shortcut/user error made by a single user.
> Same problem and same solution as above, as far as I can see.
>> (None of these examples even involve the "cvs admin" which allows you
>> to do some truly scary and demented things to your CVS history...)
> Even some of these might be permitted. For example:
> * Obsoleting already-converted revisions: it's a pretty stupid thing to
> do in most cases and the tool could just ignore such events, retaining
> the history in Git. If the revisions were obsoleted because they
> contained proprietary information or something, then you've got a bigger
> problem on your hands but one that you would have even if you were using
> pure Git.
> * Retroactive changes to log messages: would probably have to be ignored
> or handled via notes.
> * Changes to the "default branch" (another brain-dead CVS feature
> related to vendor branches): I'd have to think about it. But handling
> vendor branches is already difficult for a one-time converter because
> CVS retains too little info (but cvs2git does it except in the most
> ambiguous cases). An incremental importer would have *more* information
> than a one-shot importer, because it would have a hope of catching the
> change to the default branch at roughly the time it occurred.
Agreed, but if you want correct metadata (_when_ did these changes
happen, _who_ performed them), then you need to actually monitor the
CVS command stream (or CVS server files) in real time...
>> My point here is that people will use whatever available tools they
>> have to solve whatever problems they are currently having. And when
>> CVS is your tool, you will sooner or later end up with a "solution"
>> that irrevocably rewrites your CVS history.
> Yes, but I maintain that an incremental importer could keep a Git
> history that is consistent with the CVS history in the sense that:
> 1. the result of checking out any branch or tag, right after a run of
> the importer, gives the same results as checking the same branch or tag
> out of CVS.
> 2. the Git history from one run is added to (never rewritten) by the
> next run.
Yes, and even my simplest/fastest possible converter described above
can meet those criteria. After that, it really becomes a question of
_how_much_ CVS history you want to retain in your incremental import.
I have described the two extremes above. Interestingly, _both_ of
those extremes would look quite different from the
whole-history-gone-incremental converters represented by cvs2git and
cvs-fast-export, and _both_ of the extremes would probably also
provide a converted result quite a bit faster than anything in between
(one by virtue of depending on a single "cvs update" command, and the
other by monitoring the CVS server and performing the conversion to
Git in real time).
[*]: That said, I suspect git-cvsserver would be a good starting point
for implementing a CVS server proxy, if someone is actually interested
in looking at this...
Johan Herland, <jo...@herland.net>
To unsubscribe from this list: send the line "unsubscribe git" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html