Below are several more issues I found in reposurgeon-6a conversion comparing it 
against gcc-reparent conversion.

I am sure, these and whatever other problems I may find in the reposurgeon 
conversion can be fixed in time.  However, I don't see why should bother.  My 
conversion has been available since summer 2019, I made it ready in time for 
GCC Cauldron 2019, and it didn't change in any significant way since then.

With the "Missed merges" problem (see below) I don't see how reposurgeon 
conversion can be considered "ready".  Also, I expected a diligent developer to 
compare new conversion (aka reposurgeon's) against existing conversion (aka 
gcc-pretty / gcc-reparent) before declaring the new conversion "better" or even 
"ready".  The data I'm seeing in differences between my and reposurgeon 
conversions shows that gcc-reparent conversion is /better/.

I suggest that GCC community adopts either gcc-pretty or gcc-reparent 
conversion.  I welcome Richard E. to modify his summary scripts to work with 
svn-git scripts, which should be straightforward, and I'm ready to help.

Meanwhile, I'm going to add additional root commits to my gcc-reparent 
conversion to bring in "missing" branches (the ones, which don't share history 
with trunk@1) and restart daily updates of gcc-reparent conversion.

Finally, with the comparison data I have, I consider statements about git-svn's 
poor quality to be very misleading.  Git-svn may have had serious bugs years 
ago when Eric R. evaluated it and started his work on reposurgeon.  But a lot 
of development has happened and many problems have been fixed since them.  At 
the moment it is reposurgeon that is producing conversions with obscure 
mistakes in repository metadata.


=== Missed merges ===

Reposurgeon misses merges from trunk on 130+ branches.  I've spot-checked 
ARM/hard_vfp_branch and redhat/gcc-9-branch and, indeed, rather mundane merges 
were omitted.  Below is analysis for ARM/hard_vfp_branch.

$ git log --stat refs/remotes/gcc-reposurgeon-6a/ARM/hard_vfp_branch~4
----
commit ef92c24b042965dfef982349cd5994a2e0ff5fde
Author: Richard Earnshaw <rearn...@gcc.gnu.org>
Date:   Mon Jul 20 08:15:51 2009 +0000

    Merge trunk through to r149768
    
    Legacy-ID: 149804

 COPYING.RUNTIME                                     |    73 +
 ChangeLog                                           |   270 +-
 MAINTAINERS                                         |    19 +-
<MANY OTHER FILES>
----

at the same time for svn-git scripts we have:

$ git log --stat refs/remotes/gcc-reparent/ARM/hard_vfp_branch~4
----
commit ce7d5c8df673a7a561c29f095869f20567a7c598
Merge: 4970119c20da 3a69b1e566a7
Author: Richard Earnshaw <rearn...@arm.com>
Date:   Mon Jul 20 08:15:51 2009 +0000

    Merge trunk through to r149768
    
    git-svn-id: https://gcc.gnu.org/svn/gcc/branches/ARM/hard_vfp_branch@149804 
138bc75d-0d04-0410-961f-82ee72b054a4
----

... which agrees with
$ svn propget svn:mergeinfo 
file:///home/maxim.kuvyrkov/tmpfs-stuff/svnrepo/branches/ARM/hard_vfp_branch@149804
/trunk:142588-149768

=== Bad author entries ===

Reposurgeon-6a conversion has authors "12:46:56 1998 Jim Wilson" and 
"2005-03-18 Kazu Hirata".  It is rather obvious that person's name is unlikely 
to start with a digit.

=== Missed authors ===

Reposurgeon-6a conversion misses many authors, below is a list of people with 
names starting with "A".

Akos Kiss
Anders Bertelrud
Andrew Pochinsky
Anton Hartl
Arthur Norman
Aymeric Vincent

=== Conservative author entries ===

Reposurgeon-6a conversion uses default "@gcc.gnu.org" emails for many commits 
where svn-git conversion manages to extract valid email from commit data.  This 
happens for hundreds of author entries.

Regards,

--
Maxim Kuvyrkov
https://www.linaro.org


> On Dec 26, 2019, at 7:11 PM, Maxim Kuvyrkov <maxim.kuvyr...@linaro.org> wrote:
> 
> 
>> On Dec 26, 2019, at 2:16 PM, Jakub Jelinek <ja...@redhat.com> wrote:
>> 
>> On Thu, Dec 26, 2019 at 11:04:29AM +0000, Joseph Myers wrote:
>> Is there some easy way (e.g. file in the conversion scripts) to correct
>> spelling and other mistakes in the commit authors?
>> E.g. there are misspelled surnames, etc. (e.g. looking at my name, I see
>> Jakub Jakub Jelinek (1):
>> Jakub Jeilnek (1):
>> Jelinek (1):
>> entries next to the expected one with most of the commits.
>> For the misspellings, wonder if e.g. we couldn't compute edit distances from
>> other names and if we have one with many commits and then one with very few
>> with small edit distance from those, flag it for human review.
> 
> This is close to what svn-git-author.sh script is doing in gcc-pretty and 
> gcc-reparent conversions.  It ignores 1-3 character differences in 
> author/committer names and email addresses.  I've audited results for all 
> branches and didn't spot any mistakes.
> 
> In other news, I'm working on comparison of gcc-pretty, gcc-reparent and 
> gcc-reposurgeon-5a repos among themselves.  Below are current notes for 
> comparison of gcc-pretty/trunk and gcc-reposurgeon-5a/trunk.
> 
> == Merges on trunk ==
> 
> Reposurgeon creates merge entries on trunk when changes from a branch are 
> merged into trunk.  This brings entire development history from the branch to 
> trunk, which is both good and bad.  The good part is that we get more 
> visibility into how the code evolved.  The bad part is that we get many 
> "noisy" commits from merged branch (e.g., "Merge in trunk" every few 
> revisions) and that our SVN branches are work-in-progress quality, not ready 
> for review/commit quality.  It's common for files to be re-written in large 
> chunks on branches.
> 
> Also, reposurgeon's commit logs don't have information on SVN path from which 
> the change came, so there is no easy way to determine that a given commit is 
> from a merged branch, not an original trunk commit.  Git-svn, on the other 
> hand, provides "git-svn-id: <path>@<revision>" tags in its commit logs.
> 
> My conversion follows current GCC development policy that trunk history 
> should be linear.  Branch merges to trunk are squashed.  Merges between 
> non-trunk branches are handled as specified by svn:mergeinfo SVN properties.
> 
> == Differences in trees ==
> 
> Git trees (aka filesystem content) match between pretty/trunk and 
> reposurgeon-5a/trunk from current tip and up tosvn's r130805.
> Here is SVN log of that revision (restoration of deleted trunk):
> ------------------------------------------------------------------------
> r130805 | dberlin | 2007-12-13 01:53:37 +0000 (Thu, 13 Dec 2007)
> Changed paths:
>   A /trunk (from /trunk:130802)
> ------------------------------------------------------------------------
> 
> Reposurgeon conversion has:
> -------------
> commit 7e6f2a96e89d96c2418482788f94155d87791f0a
> Author: Daniel Berlin <dber...@gcc.gnu.org>
> Date:   Thu Dec 13 01:53:37 2007 +0000
> 
>    Readd trunk
> 
>    Legacy-ID: 130805
> 
> .gitignore | 17 -----------------
> 1 file changed, 17 deletions(-)
> -------------
> and my conversion has:
> -------------
> commit fb128f3970789ce094c798945b4fa20eceb84cc7
> Author: Daniel Berlin <dber...@dbrelin.org>
> Date:   Thu Dec 13 01:53:37 2007 +0000
> 
>    Readd trunk
> 
> 
>    git-svn-id: https://gcc.gnu.org/svn/gcc/trunk@130805 
> 138bc75d-0d04-0410-961f-82ee72b054a4
> -------------
> 
> It appears that .gitignore has been added in r1 by reposurgeon and then 
> deleted at r130805.  In SVN repository .gitignore was added in r195087.  I 
> speculate that addition of .gitignore at r1 is expected, but it's deletion at 
> r130805 is highly suspicious.
> 
> == Committer entries ==
> 
> Reposurgeon uses $u...@gcc.gnu.org for committer email addresses even when it 
> correctly detects author name from ChangeLog.
> 
> reposurgeon-5a:
> r278995 Martin Liska <mli...@suse.cz> Martin Liska <mar...@gcc.gnu.org>
> r278994 Jozef Lawrynowicz <joze...@mittosystems.com> Jozef Lawrynowicz 
> <joz...@gcc.gnu.org>
> r278993 Frederik Harwath <frede...@codesourcery.com> Frederik Harwath 
> <frede...@gcc.gnu.org>
> r278992 Georg-Johann Lay <a...@gjlay.de> Georg-Johann Lay <g...@gcc.gnu.org>
> r278991 Richard Biener <rguent...@suse.de> Richard Biener 
> <rgue...@gcc.gnu.org>
> 
> pretty:
> r278995 Martin Liska <mli...@suse.cz> Martin Liska <mli...@suse.cz>
> r278994 Jozef Lawrynowicz <joze...@mittosystems.com> Jozef Lawrynowicz 
> <joze...@mittosystems.com>
> r278993 Frederik Harwath <frede...@codesourcery.com> Frederik Harwath 
> <frede...@codesourcery.com>
> r278992 Georg-Johann Lay <a...@gjlay.de> Georg-Johann Lay <a...@gjlay.de>
> r278991 Richard Biener <rguent...@suse.de> Richard Biener <rguent...@suse.de>
> 
> == Bad summary line ==
> 
> While looking around r138087, below caught my eye.  Is the contents of 
> summary line as expected?
> 
> commit cc2726884d56995c514d8171cc4a03657851657e
> Author: Chris Fairles <chris.fair...@gmail.com>
> Date:   Wed Jul 23 14:49:00 2008 +0000
> 
>    acinclude.m4 ([GLIBCXX_CHECK_CLOCK_GETTIME]): Define GLIBCXX_LIBS.
> 
>    2008-07-23  Chris Fairles <chris.fair...@gmail.com>
> 
>            * acinclude.m4 ([GLIBCXX_CHECK_CLOCK_GETTIME]): Define 
> GLIBCXX_LIBS.
>            Holds the lib that defines clock_gettime (-lrt or -lposix4).
>            * src/Makefile.am: Use it.
>            * configure: Regenerate.
>            * configure.in: Likewise.
>            * Makefile.in: Likewise.
>            * src/Makefile.in: Likewise.
>            * libsup++/Makefile.in: Likewise.
>            * po/Makefile.in: Likewise.
>            * doc/Makefile.in: Likewise.
> 
>    Legacy-ID: 138087
> 
> 
> --
> Maxim Kuvyrkov
> https://www.linaro.org
> 

Reply via email to