On Mon, 30 Dec 2019, Segher Boessenkool wrote:

> To make it not be super much work, I'd do the second option: better
> heuristics.  Those in Maxim's conversion have been great since over half
> a year, you could borrow some, or peek for inspiration?

Actually, comparing authors between the two conversions shows plenty of 
places where the more aggressive ChangeLog extraction in Maxim's 
conversion has produced less good attributions than reposurgeon (e.g. 
attributing merges to some random author from a ChangeLog modified in the 
merge, rather than to the committer of the merge, or attributing fixes in 
a ChangeLog to the author of a random entry that got fixed), as well as 
places where it's simply failed to extract an author from a ChangeLog that 
reposurgeon has extracted.  So for "great", read "have some good ideas to 
learn from, but plenty of places with problems as well".

I'm working on more detailed comparison of authors with some more 
heuristics to help identify the most interesting cases for manual 
inspection (those where it's more likely Maxim's heuristics are finding 
valid authors reposurgeon didn't) and separate those from cases where 
different subjective choices were made (e.g. of how to assign an author 
when one person backports another's patch, or multi-author commits where 
one conversion chose one author as the main one and the other conversion 
chose the other author).

> If you guys want to ever finish, you'll need to drop the quest for
> perfection, because this leads to a) much more work, and b) worse quality
> in the end.

To me, that indicates that using a conversion tool that is conservative in 
its heuristics, and then selectively applying improvements to the extent 
they can be done safely with manual review in a reasonable time, is better 
than applying a conversion tool with more aggressive heuristics.

The issues with the reposurgeon conversion listed in Maxim's last comments 
were of the form "reposurgeon is being conservative in how it generates 
metadata from SVN information".  I think that's a very good basis for 
adding on a limited set of safe improvements to authors and commit 
messages that can be done reasonably soon and then doing the final 
conversion with reposurgeon.

-- 
Joseph S. Myers
jos...@codesourcery.com

Reply via email to