Below are several more issues I found in reposurgeon-6a conversion comparing it against gcc-reparent conversion.
I am sure, these and whatever other problems I may find in the reposurgeon conversion can be fixed in time. However, I don't see why should bother. My conversion has been available since summer 2019, I made it ready in time for GCC Cauldron 2019, and it didn't change in any significant way since then. With the "Missed merges" problem (see below) I don't see how reposurgeon conversion can be considered "ready". Also, I expected a diligent developer to compare new conversion (aka reposurgeon's) against existing conversion (aka gcc-pretty / gcc-reparent) before declaring the new conversion "better" or even "ready". The data I'm seeing in differences between my and reposurgeon conversions shows that gcc-reparent conversion is /better/. I suggest that GCC community adopts either gcc-pretty or gcc-reparent conversion. I welcome Richard E. to modify his summary scripts to work with svn-git scripts, which should be straightforward, and I'm ready to help. Meanwhile, I'm going to add additional root commits to my gcc-reparent conversion to bring in "missing" branches (the ones, which don't share history with trunk@1) and restart daily updates of gcc-reparent conversion. Finally, with the comparison data I have, I consider statements about git-svn's poor quality to be very misleading. Git-svn may have had serious bugs years ago when Eric R. evaluated it and started his work on reposurgeon. But a lot of development has happened and many problems have been fixed since them. At the moment it is reposurgeon that is producing conversions with obscure mistakes in repository metadata. === Missed merges === Reposurgeon misses merges from trunk on 130+ branches. I've spot-checked ARM/hard_vfp_branch and redhat/gcc-9-branch and, indeed, rather mundane merges were omitted. Below is analysis for ARM/hard_vfp_branch. $ git log --stat refs/remotes/gcc-reposurgeon-6a/ARM/hard_vfp_branch~4 ---- commit ef92c24b042965dfef982349cd5994a2e0ff5fde Author: Richard Earnshaw <rearn...@gcc.gnu.org> Date: Mon Jul 20 08:15:51 2009 +0000 Merge trunk through to r149768 Legacy-ID: 149804 COPYING.RUNTIME | 73 + ChangeLog | 270 +- MAINTAINERS | 19 +- <MANY OTHER FILES> ---- at the same time for svn-git scripts we have: $ git log --stat refs/remotes/gcc-reparent/ARM/hard_vfp_branch~4 ---- commit ce7d5c8df673a7a561c29f095869f20567a7c598 Merge: 4970119c20da 3a69b1e566a7 Author: Richard Earnshaw <rearn...@arm.com> Date: Mon Jul 20 08:15:51 2009 +0000 Merge trunk through to r149768 git-svn-id: https://gcc.gnu.org/svn/gcc/branches/ARM/hard_vfp_branch@149804 138bc75d-0d04-0410-961f-82ee72b054a4 ---- ... which agrees with $ svn propget svn:mergeinfo file:///home/maxim.kuvyrkov/tmpfs-stuff/svnrepo/branches/ARM/hard_vfp_branch@149804 /trunk:142588-149768 === Bad author entries === Reposurgeon-6a conversion has authors "12:46:56 1998 Jim Wilson" and "2005-03-18 Kazu Hirata". It is rather obvious that person's name is unlikely to start with a digit. === Missed authors === Reposurgeon-6a conversion misses many authors, below is a list of people with names starting with "A". Akos Kiss Anders Bertelrud Andrew Pochinsky Anton Hartl Arthur Norman Aymeric Vincent === Conservative author entries === Reposurgeon-6a conversion uses default "@gcc.gnu.org" emails for many commits where svn-git conversion manages to extract valid email from commit data. This happens for hundreds of author entries. Regards, -- Maxim Kuvyrkov https://www.linaro.org > On Dec 26, 2019, at 7:11 PM, Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> wrote: > > >> On Dec 26, 2019, at 2:16 PM, Jakub Jelinek <ja...@redhat.com> wrote: >> >> On Thu, Dec 26, 2019 at 11:04:29AM +0000, Joseph Myers wrote: >> Is there some easy way (e.g. file in the conversion scripts) to correct >> spelling and other mistakes in the commit authors? >> E.g. there are misspelled surnames, etc. (e.g. looking at my name, I see >> Jakub Jakub Jelinek (1): >> Jakub Jeilnek (1): >> Jelinek (1): >> entries next to the expected one with most of the commits. >> For the misspellings, wonder if e.g. we couldn't compute edit distances from >> other names and if we have one with many commits and then one with very few >> with small edit distance from those, flag it for human review. > > This is close to what svn-git-author.sh script is doing in gcc-pretty and > gcc-reparent conversions. It ignores 1-3 character differences in > author/committer names and email addresses. I've audited results for all > branches and didn't spot any mistakes. > > In other news, I'm working on comparison of gcc-pretty, gcc-reparent and > gcc-reposurgeon-5a repos among themselves. Below are current notes for > comparison of gcc-pretty/trunk and gcc-reposurgeon-5a/trunk. > > == Merges on trunk == > > Reposurgeon creates merge entries on trunk when changes from a branch are > merged into trunk. This brings entire development history from the branch to > trunk, which is both good and bad. The good part is that we get more > visibility into how the code evolved. The bad part is that we get many > "noisy" commits from merged branch (e.g., "Merge in trunk" every few > revisions) and that our SVN branches are work-in-progress quality, not ready > for review/commit quality. It's common for files to be re-written in large > chunks on branches. > > Also, reposurgeon's commit logs don't have information on SVN path from which > the change came, so there is no easy way to determine that a given commit is > from a merged branch, not an original trunk commit. Git-svn, on the other > hand, provides "git-svn-id: <path>@<revision>" tags in its commit logs. > > My conversion follows current GCC development policy that trunk history > should be linear. Branch merges to trunk are squashed. Merges between > non-trunk branches are handled as specified by svn:mergeinfo SVN properties. > > == Differences in trees == > > Git trees (aka filesystem content) match between pretty/trunk and > reposurgeon-5a/trunk from current tip and up tosvn's r130805. > Here is SVN log of that revision (restoration of deleted trunk): > ------------------------------------------------------------------------ > r130805 | dberlin | 2007-12-13 01:53:37 +0000 (Thu, 13 Dec 2007) > Changed paths: > A /trunk (from /trunk:130802) > ------------------------------------------------------------------------ > > Reposurgeon conversion has: > ------------- > commit 7e6f2a96e89d96c2418482788f94155d87791f0a > Author: Daniel Berlin <dber...@gcc.gnu.org> > Date: Thu Dec 13 01:53:37 2007 +0000 > > Readd trunk > > Legacy-ID: 130805 > > .gitignore | 17 ----------------- > 1 file changed, 17 deletions(-) > ------------- > and my conversion has: > ------------- > commit fb128f3970789ce094c798945b4fa20eceb84cc7 > Author: Daniel Berlin <dber...@dbrelin.org> > Date: Thu Dec 13 01:53:37 2007 +0000 > > Readd trunk > > > git-svn-id: https://gcc.gnu.org/svn/gcc/trunk@130805 > 138bc75d-0d04-0410-961f-82ee72b054a4 > ------------- > > It appears that .gitignore has been added in r1 by reposurgeon and then > deleted at r130805. In SVN repository .gitignore was added in r195087. I > speculate that addition of .gitignore at r1 is expected, but it's deletion at > r130805 is highly suspicious. > > == Committer entries == > > Reposurgeon uses $u...@gcc.gnu.org for committer email addresses even when it > correctly detects author name from ChangeLog. > > reposurgeon-5a: > r278995 Martin Liska <mli...@suse.cz> Martin Liska <mar...@gcc.gnu.org> > r278994 Jozef Lawrynowicz <joze...@mittosystems.com> Jozef Lawrynowicz > <joz...@gcc.gnu.org> > r278993 Frederik Harwath <frede...@codesourcery.com> Frederik Harwath > <frede...@gcc.gnu.org> > r278992 Georg-Johann Lay <a...@gjlay.de> Georg-Johann Lay <g...@gcc.gnu.org> > r278991 Richard Biener <rguent...@suse.de> Richard Biener > <rgue...@gcc.gnu.org> > > pretty: > r278995 Martin Liska <mli...@suse.cz> Martin Liska <mli...@suse.cz> > r278994 Jozef Lawrynowicz <joze...@mittosystems.com> Jozef Lawrynowicz > <joze...@mittosystems.com> > r278993 Frederik Harwath <frede...@codesourcery.com> Frederik Harwath > <frede...@codesourcery.com> > r278992 Georg-Johann Lay <a...@gjlay.de> Georg-Johann Lay <a...@gjlay.de> > r278991 Richard Biener <rguent...@suse.de> Richard Biener <rguent...@suse.de> > > == Bad summary line == > > While looking around r138087, below caught my eye. Is the contents of > summary line as expected? > > commit cc2726884d56995c514d8171cc4a03657851657e > Author: Chris Fairles <chris.fair...@gmail.com> > Date: Wed Jul 23 14:49:00 2008 +0000 > > acinclude.m4 ([GLIBCXX_CHECK_CLOCK_GETTIME]): Define GLIBCXX_LIBS. > > 2008-07-23 Chris Fairles <chris.fair...@gmail.com> > > * acinclude.m4 ([GLIBCXX_CHECK_CLOCK_GETTIME]): Define > GLIBCXX_LIBS. > Holds the lib that defines clock_gettime (-lrt or -lposix4). > * src/Makefile.am: Use it. > * configure: Regenerate. > * configure.in: Likewise. > * Makefile.in: Likewise. > * src/Makefile.in: Likewise. > * libsup++/Makefile.in: Likewise. > * po/Makefile.in: Likewise. > * doc/Makefile.in: Likewise. > > Legacy-ID: 138087 > > > -- > Maxim Kuvyrkov > https://www.linaro.org >