Re: Commit messages and the move to git

Richard Earnshaw (lists) Mon, 02 Dec 2019 02:54:42 -0800

On 19/11/2019 14:56, Jason Merrill wrote:

On Mon, Nov 18, 2019 at 4:38 PM Richard Earnshaw (lists) <
richard.earns...@arm.com> wrote:

On 18/11/2019 20:53, Jason Merrill wrote:

On Mon, Nov 18, 2019 at 2:51 PM Segher Boessenkool <
seg...@kernel.crashing.org> wrote:

On Mon, Nov 18, 2019 at 07:21:22PM +0000, Richard Earnshaw (lists)

wrote:

On 18/11/2019 18:53, Segher Boessenkool wrote:

PR target/92140: clang vs gcc optimizing with adc/sbb
PR fortran/91926: assumed rank optional
PR tree-optimization/91532: [SVE] Redundant predicated store in

gcc.target/aarch64/fmla_2.c

PR tree-optimization/92161: ICE in vect_get_vec_def_for_stmt_copy, at

tree-vect-stmts.c:1687

PR tree-optimization/92162: ICE in vect_create_epilog_for_reduction,

at tree-vect-loop.c:4252

PR c++/92015: internal compiler error: in cxx_eval_array_reference, at

cp/constexpr.c:2568

PR tree-optimization/92173: ICE in optab_for_tree_code, at

optabs-tree.c:81

PR tree-optimization/92173: ICE in optab_for_tree_code, at

optabs-tree.c:81

PR fortran/92174: runtime error: index 15 out of bounds for type

'gfc_expr *[15]


Most of these aren't helpful at all, and none of these are good commit
summaries.  The PR92173 one actually has identical commit messages

btw,

huh.  Ah, the second one (r277288) has the wrong changelog, but in the
actual changelog file as well, not something any tool could fix up (or
have we reached the singularity?)


Identical commits are normally from where the same commit is made to
multiple branches.  It's not uncommon to see this when bugs are fixed.


This is an actual mistake.  The commits are not identical at all, just
the commit messages are (and the changelog entries, too).  Not something
that happens to ften, but of course I hit it in the first random thing I
pick :-)

Ultimately the question here is whether something like the above is

more

or less useful than what we have today, which is summary lines of the

form:


<date> <user> <email>


I already said I would prefer things like
   Patch related to PR323
as the patch subject lines.  No one argues that the current state of
affairs is good.  I argue that replacing this with often wrong and
irrelevant information isn't the best we can do.


How about using the first line that isn't a ChangeLog date/author line,
without trying to rewrite/augment it?

Jason


It would certainly be another way of doing it.  Sometimes it would
produce almost the same as an unadulterated PR; sometimes it would
produce something more meaningful and sometimes it would be pretty
useless.  It probably would hit more cases than my current script in
that it wouldn't require the commit to mention a PR in it.

The main problem is that the first line is often incomplete, and much of
it is also wasted with elements like the full path to a file that is
quite deep in the tree.  Lets take a quick example (the first I found in
the dump I have).

1998-12-17  Vladimir N. Makarov  <vmaka...@cygnus.com>
         * config/i60/i960.md (extendqihi2): Fix typo (usage ',' instead of
         ';').
1998-12-17  Michael Tiemann  <tiem...@axon.cygnus.com>
         * i960.md (extend*, zero_extend*): Don't generate rtl that looks
         like (subreg:SI (reg:SI N) 0), because it's wrong, and it hides
         optimizations from the combiner.

Firstly, this example misses a blank line between the author and the
change message itself, which makes distinguishing between this and the
multiple authors case harder.  Secondly, the entry really spans two
lines and cutting it off at the end of the first line would be, well a
bit odd.  We could try to piece things together more, by combining lines
until we find a sentence end ( \.$ or \.\s\s ), and we could also strip
off more of the path components to just leave the name of the file
committed.  For example,

i960.md (extendqihi2): Fix typo (usage ',' instead of ';').

That might work better, but obviously it's harder to handle and thus
more likely to create erroneous summaries.


Yep. I don't think we need to worry about getting optimal one-line
summaries for ancient commits; something reasonably unique should be plenty.

Attached is the latest version of my script. I used (very nearly) thisto produce a conversion over the weekend and I've uploaded that here:


https://gitlab.com/rearnsha/gcc-trial-20191130

Note, that I might blow this away at any time. IT IS NOT A FINALCONVERSION.


Some other things to note:

- there are a number of known issues with the version of reposurgeonused for this that are being worked on

  - emptycommit-* tags - my control script was out-of-date
  - *deleted* branches - this is being worked on
  - weird dependencies around merges - this is being worked on
  - author attributions are sometimes incorrect - reported

The main difference between the attached script and the one I used forthis conversion is that ChangeLog change that contain :: inside afunction list is now handled correctly, resulting in a number of casesthat were previously being missed now being correctly handled.


Choices I made:

- When a PR is used to derive the summary, I prefix this with 're' (asin the Latin 'in re'.

- long change hunks produce poor summaries.  To reduce the overhead:
  - path names are removed, leaving just the final file name
  - multiple files are replaced with [...] after the first filename
  - similarly, multiple function names are replaced with [...]
  - very long comments are truncated, preferably at the strongest
    punctuation mark, but sometimes after key words, such as 'if',
    'when', 'unless' and a few more.  Ultimately, if the line is
    still too long, we just break after an arbitrary space.
- Where possible useful summary lines that appear after an author,
  attribution are hoisted as a summary.
- certain key words in otherwise not very useful summary lines are
  also spotted and used to add [revert] or [backport] annotations to
  the summary.

No changes are made to the main commit log, if we add a new summaryline, the entire original text is kept.

An example of a summary produced by this is for the commit to r278572,where the original log entry is:



        Backported from mainline
        2019-08-02  Jakub Jelinek  <ja...@redhat.com>

        * quadmath.h (M_Eq, M_LOG2Eq, M_LOG10Eq, M_LN2q, M_LN10q, M_PIq,
        M_PI_2q, M_PI_4q, M_1_PIq, M_2_PIq, M_2_SQRTPIq, M_SQRT2q,
        M_SQRT1_2q): Use two more decimal places.

And the script then generates:

[backport] quadmath.h (M_Eq, [...]): Use two more decimal places.

as the summary.

R.

#!/usr/bin/python2

import requests
import json
import os.path
import codecs
import email.message, email.parser
import re
import signal
import sys

# Files that we read or write.
buglist = "bugids.list"
bugdbfile = "bugs.json"
commits_in = "gcc.commitlog"
commits_out = "gcc.fixedprs"
commits_unfixed = "gcc.unfixed"

bugdb = {}

# Dictionary indexed by bugzilla component.  If the claimed PR component
# matches against the real component, then assume this probably an OK
# matching.  We avoid a lot of false positive warnings this way.
# A special match of "*" means we don't care in this case.  Many bugs
# are filed under, for example, bootstrap or other when in fact the
# real component will ultimately be shown in the commit log.
compalias = {
    "rtl-optimization": ["opt", "optimization", "optimisation", "optimize",
                         "middle-end", "rtl-opt", "rtl_optimization",
                         "rtl-optimizations", "rtl"],
    "tree-optimization": ["opt", "optimization", "optimisation", "optimize",
                          "middle-end", "tree-opt", "tree-optimzation",
                          "graphite", "lto", "tree-optimizations",
                          "tree-optimizer", "ipa", "tree-optimize",
                          "tree-optimizatio"],
    "java": ["gcj", "libgcj", "libjava"],
    "fastjar": ["java"],
    "libgcj": ["java", "libjava"],
    "classpath": ["libgcj", "libjava", "xml", "crypto"],
    "awt": ["classpath", "libjava"],
    "xml": ["classpath", "libjava"],
    "crypto": ["classpath", "libjava"],
    "objc++": ["obj-c++"],
    "c++": ["libstdc++", "diagnostic", "cp"],
    "c": ["diagnostic", "c-family"],
    "libstdc++": ["c++", "libstdc++-v3"],
    "libmudflap": ["mudflap"],
    "preprocessor": ["cpp", "pch"],
    "bootstrap": ["*"],
    "testsuite": ["*"],
    "regression": ["*"],
    "sanitizer": ["sanitize", "ubsan"],
    "other": ["*"],
    "middle-end": ["optimization", "rtl-optimization", "tree-optimization",
                   "tree-opt", "rtl-opt", "ipa", "lto", "middle", "middle-ed"],
    "ipa": ["middle-end", "lto"],
    "lto": ["middle-end", "ipa", "tree-optimization"],
    "target": ["*"],
    "plugins": ["*"],
    "libf2c": ["fortran", "gfortran", "libgfortran", "libfortran"],
    "fortran": ["gfortran", "libgfortran", "libfortran", "forrtran"],
    "libfortran": ["fortran", "gfortran", "libgfortran", "libfofortran"],
    "libgomp": ["fortran"],
    "gcov-profile": ["profile", "gcov"],
    "libffi": ["other"],
    "web": ["documentation"]}

# List of Legacy-IDs where the PR is correct, dispite the anomaly
# checks we perform.
whitelist = ["52862", "53125", "57958", "57959", "58705", "51884", "58535",
             "58542",
             "61329", "61770", "61960", "61984", "63783", "65127", "65128",
             "67653", "68236", "68239",
             "70062", "70542", "70625", "70932", "72072", "72259", "72260",
             "73504", "73940", "73940", "77602", "78312", "78313",
             "80793", "82490", "82518", "83609", "84217", "84295", "84531",
             "84532", "84887", "84895", "86396", "86429", "86458", "86459",
             "86460", "86461", "87066", "87189", "87511", "87512", "87632",
             "88922", "89175", "89921",
             "90004", "91308", "91504", "92158", "92585", "92658", "94054",
             "94226", "94317", "94805", "95382", "95383", "95428", "95673",
             "96396", "96626", "97373", "97375", "97537", "98412", "98926",
             "99480", "99514",
             "100206", "100207", "100209", "100371", "100851", "100961",
             "101264", "101273", "102155", "102348", "102539", "102567",
             "102799", "102920", "103268", "103279", "103308", "104948",
             "104949", "104988", "105052", "105274", "106269", "106582",
             "107816", "108045", "109166", "109167", "109168", "109169",
             "109643", "109644", "109818",
             "110131", "110132", "110627", "110975", "111818", "111947",
             "111952", "112296", "112298", "112611", "112612", "112819",
             "113095", "113119", "113121", "113137", "113627", "113628",
             "113661", "113662",
             "113701", "113702", "113703", "114482", "115200",
             "116613", "116633", "116633", "117824",
             "117852", "118025", "118034", "118142",
             "118742", "118743", "119248", "119485", "119486",
             "120497", "120498", "120499", "120523", "121334", "121338",
             "121949", "122640", "122641", "122700", "122743", "122751",
             "122798", "124362", "125025",
             "125108", "125844", "126675", "127812", "127688", "127689",
             "128272", "128573", "128649", "129871",
             "130016", "131781", "132577", "132587",
             "141646", "142724", "142725", "142833", "144976",
             "152550", "157199", "157784",
             "160456", "160476", "163777",
             "176083", "176167", "176623", "176624",
             "182943", "187842",
             "190100", "193620", "194102", "195610", "199972",
             "201377", "201378", "202642", "202937", "205753", "209250",
             "210999", "211000", "211001", "211002", "214734", "214958",
             "215345", "215386", "215804", "215806", "215807",
             "221207", "221958",
             "233033",
             "242543", "242795",
             "259105",
             "268434",
             "276571", "276987", "276993"
]

# Indexed by Legacy-ID.  Each fixup may contain either the correct PR
# for this commit or an alternative summary string.  If the latter, this
# will form the entire summary for the commit.
fixups = {
    "15985": {"ignore": True},
    "37228": {"summary": "Use memset/memcmp instead of bzero/bcmp."},
    "52264": {"PR": "5373"}, # PR ok, but malformed.
    "57986": {"PR": "7792"},
    "52818": {"PR": "6479"},
    "58876": {"PR": "8481"}, # PR ok, but hard to parse.
    "65177": {"PR": "8808"},
    "65181": {"PR": "8808"},
    "68117": {"PR": "10712"},
    "68118": {"PR": "10712"},
    "71157": {"PR": "11867"},
    "71159": {"PR": "11867"},
    "71856": {"PR": "11415"},
    "75457": {"PR": "12815"},
    "81771": {"PR": "15204"},
    "82028": {"summary": "Multiple fixes: PRs 14692, 15274 and 15463"},
    "84508": {"PR": "16455"},
    "84522": {"PR": "15754"},
    "85966": {"PR": "16935"},
    "86431.1": {"summary": "Revert earlier fix for PR 14029"},
    "86431.2": {"summary": "Revert earlier fix for PR 14029"},
    "98287": {"PR": "17472"},
    "93620": {"PR": "19009"},
    "94658": {"PR": "19736"},
    "94889": {"PR": "19634"},
    "96560": {"PR": "20490"},
    "107451": {"PR": "24236"},
    "110063": {"summary": "PR25024, PR20881, PR23308, PR25538 and PR25710 - Procedure references"},
    "114667": {"PR": "27648"},
    "115858": {"PR": "28452"},
    "122127": {"PR": "30795"},
    "122980": {"PR": "31025"},
    "141612": {"PR": "32519"},
    "146134": {"PR": "38668"},
    "146593": {"PR": "39632"},
    "146650": {"PR": "39632"},
    "151112": {"PR": "28039"},
    "158393": {"PR": "43741"},
    "171924": {"PR": "48401"},
    "174228": {"PR": "46005"},
    "174235": {"PR": "46005"},
    "179230": {"PR": "45012"},
    "188077": {"PR": "52700"},
    "188078": {"PR": "52700"},
    "188079": {"PR": "52700"},
    "206073": {"PR": "35545"},
    "206074": {"PR": "35545"},
    "224527": {"PR": "56766"},
    "268301": {"PR": "89020"},
    "249711": {"summary": "Revert: backport PRs 80382, 80966"},
    "274921": {"summary": "[backport] Enable math functions linking with static library for LTO"},
    "276627": {"PR": "47054"}
    }

full_author = re.compile (r"\s*((Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec)\s+\d{1,2}\s+\d{2}:\d{2}:\d{2}\s+([A-Z]{3}\s+)?\d{4}|\d{4}-\d{2}-\d{2})\s+.+\s[(<][\w.]+@[\w.-]+[>)]")

def read_bugdb(dbfile):
    try:
        with open(dbfile, "r") as f:
            bugdb.update(json.load(f))
            print(len(bugdb))
            f.close()
    except:
        return

def write_bugdb(dbfile):
    try:
        os.delete(dbfile + ".bak")
    except:
        pass
    try:
        os.rename(dbfile, dbfile + ".bak")
    except:
        pass
    try:
        with open(dbfile, "w") as f:
            json.dump(bugdb, f)
            f.close()
    except:
        print("Failed to write" + dbfile)
        exit()

def fetch_sublist(sublist):
    url = 'https://gcc.gnu.org/bugzilla/rest.cgi/bug?id='
    url += ','.join([elem for elem in sublist])
    url += '&include_fields=id,summary,component,resolution'

    resp = requests.get(url)

    if resp.status_code != 200:
        print(resp.status_code)
    return resp.json()

def process(jdata):
    buglist = jdata['bugs']
    for bug in buglist:
        bugdb[str(bug['id'])] = {"summary": bug['summary'],
                                 "component": bug['component'],
                                 "resolution": bug['resolution']}

    print(len(bugdb))

def fetchall(bugids):
    f = open(bugids, "r")
    allbugids = f.read().splitlines()
    sublist = []
    for bugid in allbugids:
        if str(bugid) not in bugdb:
            #print("missing: " + id)
            sublist.append(bugid)
            if len(sublist) == 200:
                process(fetch_sublist(sublist))
                sublist = []
    for fixup in fixups.values():
        if "PR" in fixup and fixup['PR'] not in bugdb:
            sublist.append(fixup['PR'])
            if len(sublist) == 200:
                process(fetch_sublist(sublist))
                sublist = []
    if sublist:
        process(fetch_sublist(sublist))

# This is lifted from the python version of reposurgeon, since we are
# parsing objects it created.

# Since we assume python2
rspolicy = None

class RepoSurgeonEmail(email.message.Message, object):
    "Specialized email message with a distinguishing starter."
    Divider = 78 * "-"
    __hash__ = None
    def __init__(self, **kwargs):
        if rspolicy:
            # This activates the Python 3 email header handling hack
            kwargs['policy'] = rspolicy
        email.message.Message.__init__(self, **kwargs)
        self.set_unixfrom(RepoSurgeonEmail.Divider)
    @staticmethod
    def readmsg(fp):
        msg = ''
        firstline = fp.readline()
        if not firstline:
            return None
        elif not firstline.startswith(RepoSurgeonEmail.Divider):
            msg = firstline
        while True:
            line = fp.readline()
            if not line:
                break
            # Not impossible for this to get spoofed, but *highly* unlikely
            if line == "\n." + RepoSurgeonEmail.Divider + "\n":
                line = "\n" + RepoSurgeonEmail.Divider + "\n"
            if line.startswith(RepoSurgeonEmail.Divider):
                break
            msg += line
        return msg
    def payload(self):
        "Return message body - works because not multipart."
        return self.get_payload()
    def __str__(self):
        out = super(RepoSurgeonEmail, self).as_string(unixfrom=True)
        out = out.replace("\n" + RepoSurgeonEmail.Divider + "\n",
                          "\n." + RepoSurgeonEmail.Divider + "\n")
        out = out.replace("\n>From", "\nFrom")
        return out

# Close states that suggest that the bug is unlikely to be a correct
# target of a fix.
badclose = ["INVALID"]
# Would be nice to add ["WORKSFORME", "WONTFIX"], but too many false
# positives on early PRs.

# Fuzzy match a claimed bugcat against the component in the PR.  This avoids
# a number of likely false negatives without over relaxing the desire to
# detect possibly inaccurate PR numbers.
def component_mismatch(bugcat, bugid):
    # We're not bothered about case
    bugcat = bugcat.lower()
    component = bugdb[bugid]['component']
    # It's extremely unlikely that any bug classified as spam should match
    # a PR.  It's also pretty unlikely that a bug closed as INVALID would
    # do so either.  So flag up both cases.
    if (component == "spam" or bugdb[bugid]['resolution'] in badclose):
        return True
    # These blanket components lead to far too many false positives, so
    # just ignore them.  This is essentially the reverse map of the "*"
    # match later on.
    if (bugcat in ["boostrap", "bootstrap", "target", "plugins", "testsuite",
                   "other", "ice", "gcc", "translation", "regression", "build",
                   "diagnostic", "wrong-code"] or
        # These are nonesense categories, since they refer to
        # components outside of GCC.
        bugcat in ["binutils", "gas", "ld"] or
        bugcat == component):
        return False
    if component not in compalias:
        return True
    if ("*" in compalias[component] or
        bugcat in compalias[component]):
        return False
    return True

def summary_from_pr(msg, payload, lines):
    # Deal first with a first line that is a PR and a summary of its own,
    # we don't want to replace that.
    if (re.match(r"^\s*PR(\s+([^/\s]+)/|\s*#?)\d+\b", lines[0])
        and not re.match(r"^\s*PR(\s+([^/\s]+)/|\s*#?)\d+\s*[-:.]?\s*$",
                         lines[0])):
        return None

    y = re.search(r"(^|\s+|[/\(\[])PR(\s+([^/\s]+)/|\s*#?)(\d+)\b",
                  payload)
    if not y:
        return None
    bugcat = y.group(3)
    bugid = y.group(4)
    if bugid not in bugdb:
        #print("PR " + bugid + " not found in database")
        return None
    summary = bugdb[bugid]['summary'].encode("utf-8")
    # Early on there were multiple GNATS databases for bugs;
    # later they were merged into a single database, but the
    # PR numbers were changed.  Don't trust bugs below 1000
    # unless the component matches the one mentioned in the
    # log (if it even has one).
    if (int(bugid) < 1000 and
        (bugcat == None or bugcat != bugdb[bugid]['component'])):
        print("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID']
              + "): Ignoring PR " + bugid
              + " - no component or component mismatch")
        return None
    elif (bugcat != None and msg['Legacy-ID'] not in whitelist
          and component_mismatch(bugcat, bugid)):
        problem = ("E" + msg['Event-Number'] + "(r" + msg['Legacy-ID']
                   + "): '" + bugcat + "/" + bugid + "'")
        if (bugdb[bugid]['resolution'] in badclose):
            problem += (" closed as "
                        + bugdb[bugid]['resolution'].encode("utf-8"))
            summary += (" [checkme: "
                        + bugdb[bugid]['resolution'].encode("utf-8")
                        + " SVN r" + msg['Legacy-ID'] + "]")
        else:
            problem += (" mismatch PR "
                        + bugdb[bugid]['component'].encode("utf-8"))
            summary += (" [checkme: " + bugcat + " SVN r"
                        + msg['Legacy-ID'] + "]")
        print(problem)

    z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary)
    if z:
        #print(summary)
        summary = z.group(1)
    summary = ("re PR " + bugdb[bugid]['component'].encode("utf-8")
               + "/" + bugid + " (" + summary + ")")
    return summary

# Similar to strip_unintersting line, but used when we have a single
# line that looks potentially interesting (because it begins with
# a letter and ends with a period and we want to decide whether we really
# want it as the final summary.
def interesting_line(line):
    if re.match(r"^\s*([Ff]ix(e[ds])? (for )?)?PR(\s+[^/\s]+/|\s*#?)(\d+)[\.:]?\s*$",
                line):
        return False
    if re.match(r"\s*([Ff]ix|[Pp]atch) from", line):
        return False
    if re.match(r"\s*([Mm]erged?|[Bb]ack[-\s]?port(ed|s)?) (\w+ )?from ",
                line):
        return False
    if re.match(r"\s*[Ss]ynchronize with", line):
        return False
    if re.match(r"\s*[Ff]rom ", line):
        return False
    if re.match(r"\s*[Ii]n\s\S+$", line):
        return False
    if re.match(r"\s*(Reviewed|Signed-off|Acked|Reported)-by:", line):
        return False
    if re.match(r"\s*Reviewed-on:", line):
        return False
    return True

def strip_uninteresting_lines(lines):
    backport = False
    revert = False
    keepgoing = True
    # Strip any blank lines, or any line containing a single word, unless
    # that word is revert.  Mostly these make uninteresting summaries.
    # Also strip lines that begin "* From ...:", this was an early way
    # of attributing a contributor who did not make a commit themselves.
    while lines:
        if re.match(r"^\s*$", lines[0]):
            pass
        elif full_author.match(lines[0]):
            # Strip any author lines
            while (len(lines) > 1
                   and re.match(r"\s+\w.*\s<[\w.]+@[\w.-]+>", lines[1])):
                lines = lines[1:]
        elif re.match(r"\s*PR\d+", lines[0]):
            keepgoing = False
        elif re.match(r"\s*\*\s*From\s.*:$", lines[0]):
            pass
        elif re.match(r"\W*([Rr]evert(ed)?|[Bb]ack[-\s]*out)(\.|.*:)?$",
                      lines[0]):
            revert = True
        elif re.match(r"\W*([Bb]ack[-\s]*port(|s|ed)|[Mm]erged?)\b",
                      lines[0]):
            backport = True
        elif re.match(r"\s*(\S+\s*|)$", lines[0]):
            pass
        else:
            keepgoing = False
        if not keepgoing:
            break
        lines = lines[1:]

    return lines, backport, revert

changemax = 80
def find_best_breakpoint(change):
    # Try, in turn, looking for ':', ';', ',' and ' - '.  The idea is
    # to find the strongest natural break within the message.
    # Avoid breaking at '):' This is more likely to be a change hunk that
    # was malformed in some way and produces poor breaks.  Similarly for
    # :: which is often a OO function separator
    x = re.match(r"^(.*?[^\)\]:]:)\s", change)
    if not x:
        x = re.match(r"^(.*?;)\s", change)
    if not x:
        x = re.match(r"^(.*?,)\s", change)
    if not x:
        x = re.match(r"^(.*?\s-(\s|$))\s", change)
    if x:
        return x.group(1)[0:len(x.group(1)) - 1]

    # Consider words that create subclauses, breaks after these are more
    # natural, but not if they are too early in the line.
    x = re.match(r"^(.*?\b(if|when|unless))(\s|$)",
                 change[changemax / 2:changemax])
    if x:
        return change[:(changemax / 2)] + x.group(1)

    x = re.match(r"^(.*\s)", change[:changemax])
    if x:
        return x.group(1)[0:len(x.group(1)) - 1]

    return change[:changemax]

badlines=[r"\s*$",
          r"\s*\*(\s|.*:(\s|$))",
          r"\s*\((\S+,\s+)*\S+\):",
          r"\s*/?(gcc|ada|testsuite|boehm-gc)(/[-\+\w])*/?:?$",
          r"\s*/?(brig|c|[cC]\+\+(filt)?|cfamily|ch)(/[-\+\w])*/?:?$",
          r"\s*/?(config|contrib|cp|doc|f|F77|fastjar)(/[-\+\w])*/?:?$",
          r"\s*/?([Ff]ortran|gcp|gfortran|gnattools)(/[-\+\w])*/?:?$",
          r"\s*/?(go|gotools|include|intl|java)(/[-\+\w])*/?:?$",
          r"\s*/?lib(ada|asan|atomic|backtrace|banshee)(/[-\+\w])*/?:?$",
          r"\s*/?lib(cc1|cpp|ffi|f2c|gcc|fortran|go|gomp)(/[-\+\w])*/?:?$",
          r"\s*/?lib(hsail-rt|iberty|io|itm|java|mudflap)(/[-\+\w])*/?:?$",
          r"\s*/?lib(objc|offloadmic|phobos|quadmath)(/[-\+\w])*/?:?$",
          r"\s*/?lib(sanitizer|ssp|stdc\+\+|vtv)(/[-\+\w])*/?:?$",
          r"\s*/?(lto|lto-plugin|maintainer-scripts|objc)(/[-\+\w])*/?:?$",
          r"\s*/?(src|zlib)(/[-\+\w])*/?:?$",
          r"\s\[gcc[^\]]*\]$",
          r"\(top level\)$"]

# Return the first sentence, and any remaining text.
def find_first_sentence(lines):
    # Never merge a PR number with additional lines
    if re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)", lines[0]):
        return lines[0], lines[1:]
    sentence = lines[0]
    x = re.match(r"\s*(.*?[\.\?!])(\s+(.*)|$)", sentence)
    if x:
        if x.group(3):
            return x.group(1), [x.group(3)] + lines[1:]
        return x.group(1), lines[1:]
    lines = lines[1:]
    while lines:
        for badline in badlines:
            if re.match(badline, lines[0]):
                return sentence, lines
        x = re.match(r"\s*(.*?\.)(\s+(.*)|$)", lines[0])
        if x:
            if x.group(3):
                return sentence + " " + x.group(1), [x.group(3)] + lines[1:]
            return sentence + " " + x.group(1), lines[1:]
        sentence += " " + re.match(r"\s*(\S.*)$", lines[0]).group(1)
        lines = lines[1:]
    return sentence, lines

def trim_excess_ws(mystr):
    # Strip any leading white space.
    mystr = re.sub(r"^\s+", "", mystr)
    # Canonicalize all white space into single space characters
    mystr = re.sub(r"\s{2,}", " ", mystr)
    # And any trailing white space
    mystr = re.sub(r"\s+$", "", mystr)
    return mystr

def simplify_funclist(funcstr):
    funcstr = trim_excess_ws(funcstr)
    # Sanity check
    if funcstr[0] != "(" or funcstr[-1] != ")":
        return funcstr
    funcstr = funcstr[1:-1]
    firstfunc = re.match(r"^([^\(]+?(\([^\)]*\))?),\s(.*)", funcstr)
    if firstfunc and len(firstfunc.group(3)) > 10:
        return " (" + firstfunc.group(1) + ", [...])"
    return " (" + funcstr + ")"

def simplify_filelist(filestr):
    filestr = trim_excess_ws(filestr)
    x = re.match(r"(\S+/)?(.+?)((,)?\s+(.*)|$)", filestr)
    if not x:
        return filestr
    sep = ", " if x.group(4) else " "
    if not x.group(5):
        return x.group(2)
    y = re.match(r"(\S+/)?(.+?)(,?\s+(.*)|$)", x.group(5))
    # If the first two 'files' do not contain a '.', then we might not have
    # a real file list.  Better to just return the original string.
    if sep == " " and not (re.match(r"\S+\.\S", x.group(2))
                           or re.match(r"\S+\.\S", y.group(2))):
        return filestr
    if y.group(4) or len(y.group(2)) > 10:
        return x.group(2) + sep + "[...]"
    return x.group(2) + sep + y.group(2)

def summary_from_change(msg, sentence):
    # Pattern to match a commit change entry in GNU ChangeLog style.  There
    # are some variants of this, but they are rare enough that we don't
    # worry too much about not handling them.
    # This RE is just too long to comprehend when written as a single
    # string...
    re_files = r"([-=\.\w\d_/\*\?\+,\s\[\]\{\}\$]+)" # 1 group
    re_funcs = r"(\s*\(.*?\))?"           # 1 group
    re_cases = r"\s*(\[[^\]]+\])?"        # 1 group (unused)
    re_case2 = r"(<[^>]+>)?"              # 1 group (unused)
    re_descr = r"\s*:(\s+(.*))?$"         # 2 groups
    x = re.match(r"\s*\*\s*" + re_files + re_funcs + re_cases + re_case2
                 + re_descr,
                 sentence)
    if not x:
        # A relatively common typing error is to leave out the closing
        # ')' from a function list.  Try again for that option.  We
        # use a simpler match pattern for this case, leaving out the
        # cases variants.
        x = re.match(r"\s*\*\s*" + re_files + r"(\s*\([^\)]*?)" + re_descr,
                     sentence)
        if not x:
            print ("Failed to parse: " + sentence)
            return None
        filename = simplify_filelist(x.group(1))
        funcs = simplify_funclist(x.group(2) + ")")
        change = x.group(4)
    else:
        filename = simplify_filelist(x.group(1))
        funcs = simplify_funclist(x.group(2)) if x.group(2) else ""
        change = x.group(6) if x.group(6) else ""
    # Just use the basename for any filename.

    if len(change) < changemax + 5:
        return trim_excess_ws(filename + funcs + ": " + change)
    while (len(change) > changemax):
        change = find_best_breakpoint(change)
    change += "..."
    return trim_excess_ws(filename + funcs + ": " + change)

def find_best_summary(msg, lines):
    sentence, _ = find_first_sentence(lines)
    # ChangeLog entries look like "* file[ (func-list)]: change.
    if re.match(r"\s*\*\s*.+:\s", sentence):
        return summary_from_change(msg, sentence)

    return None

def already_good_summary(lines):
    while lines and lines[0] == "":
        lines = lines[1:]
    # No non-blank lines.  Nothing we can do to make this good, so best
    # to stop now.
    if not lines:
        return True
    # Anything created by cvs2svn should be left alone
    if re.search(r"by cvs2svn", lines[0]):
        return True
    # Single word lines, committer tags or simple PR tags don't make a
    # good summary
    if (full_author.match(lines[0])
        or re.match(r"\s*PR(\s+[^/\s]+/|\s*#?)(\d+)\s*$", lines[0])
        or re.match(r"s*\S+\s*$", lines[0])):
        return False
    # We likely can't make a single-line commit message better.
    if len(lines) == 1:
        return True
    if re.match(r"\s*\*", lines[0]):
        return False
    # A 'real' sentence, ie more than one word and ending with a
    # punctuation mark.  We don't need a blank line in this case as
    # reposurgeon's gitify will handle this for us.
    if re.match(r"\s*[-\w\[\(\{\]\)\}_,.:\']\S*\s+\S.*[.:;,?!]$", lines[0]):
        return interesting_line(lines[0])
    if not interesting_line(lines[0]):
        return False
    # If the second line is blank, treat this as good.
    if re.match(r"\s*$", lines[1]):
        return True
    # It might be a single-line summary that is wrapped onto more than
    # one line.  Handle this case as it's unlikely we can improve it.
    # If the line looks overly long, then we may want to produce a shorter
    # version, so punt.
    sentence, remnant = find_first_sentence(lines)
    if (len(sentence) < 120
        and remnant
        and re.match(r"\s*$", remnant[0])):
        return True
    return False

def summary_from_sentence(sentence, remnant):
    # Ignore anything that looks like a change hunk.
    if re.match(r"\s*\*", sentence):
        return None
    if not interesting_line(sentence):
        return None
    sentence = trim_excess_ws(sentence)
    length = len(sentence)
    if length < 120 and not remnant:
        return None
    elif length < 120:
        return sentence

    return find_best_breakpoint(sentence) + "..."

def processmsg(msg):
    payload = msg.get_payload()
    summary = None

    # Ignore events tagged with "emptycommit-*", these will be deleted
    # later on.
    if 'Tag-Name' in msg and re.match(r"^emptycommit-", msg['Tag-Name']):
        return None
    # Look to see if this event has a specific fixup.  Use it if so.
    if msg['Legacy-ID'] in fixups:
        fixup = fixups[msg['Legacy-ID']]
        if 'ignore' in fixup:
            return None
        if "PR" in fixup:
            summary = bugdb[fixup['PR']]['summary'].encode("utf-8")
            z = re.match(r"\s*\[[^\]]*[Rr]egression\s*\]\s*:?\s*(.*)", summary)
            if z:
                summary = z.group(1)
            summary = ("re PR "
                       + bugdb[fixup['PR']]['component'].encode("utf-8")
                       + "/" + fixup['PR'] + " (" + summary + ")")
        elif "summary" in fixup:
            summary = fixup['summary']
        else:
            print("Invalid fixup for Legacy-ID " + msg['Legacy-ID'])
            exit()

    lines = payload.splitlines()
    # Don't mess with something that looks like a good summary.
    if not summary and already_good_summary(lines):
        return None

    while lines and re.match(r"\s*$", lines[-1]):
        lines = lines[:-1]

    # Look to see if this event is a backport of multiple PRs.  Just
    # set the summary to a quick list if so.
    if not summary:
        allprs = re.findall(r"\b\s*PR(\s+([^/\s]+)/|\s*#?)(\d+)\b", payload)
        allprnos = list(dict.fromkeys([int(pr[2]) for pr in allprs]))
        allprnos.sort()
        if ((len(allprnos) > 3 and
             re.search(r"\b[Bb]ack[- ]?port(s|ed)?\b", payload))
            or (len(allprnos) > 1 and
                re.search(r"\b[Bb]ack[- ]?ports", payload))):
            prs = allprnos if len(allprnos) < 10 else allprnos[:10]
            #print prs[0]
            summary = "Backport PRs " + str(prs[0])
            for pr in prs[1:]:
                summary += ", " + str(pr)
            if len(allprnos) > 10:
                summary += " and more"

    backport = False
    revert = False
    if not summary:
        while lines and lines[0] == "":
            lines = lines[1:]
        if not lines:
            return None
        multicommit = False
        x = full_author.match(lines[0])
        author_line = lines[0]
        if x:
            if len(lines) > 2:
                is_backport = re.search(r"\b[Bb]ack[- ]?port", payload)
                for line in lines[2:]:
                    if full_author.match(line):
                        # If this looks like a backport, allow one
                        # mismatch, since often we have the author of
                        # the original commit and the author of the
                        # backport its rare to get a full match in
                        # that case.
                        if is_backport and author_line != line:
                            is_backport = None
                            author_line = line
                        elif author_line != line and lines[0] != line:
                            multicommit = True
                            break
        if multicommit:
            summary = "[multiple changes]"
            bugid = ""
            if author_line != lines[0]:
                backport = True

    if not summary:
        lines, backport, revert = strip_uninteresting_lines(lines)
        if not lines:
            return None
        sentence, remnant = find_first_sentence(lines)
        summary = summary_from_sentence(sentence, remnant)

    if not summary:
        summary = summary_from_pr(msg, payload, lines)

    if not summary:
        summary = find_best_summary(msg, lines)

    if not summary:
        return None

    preamble = ""
    if revert:
        preamble = "[revert]"
    if backport:
        preamble += "[backport]"
    if preamble != "":
        if summary[0] == "[":
            summary = preamble + summary
        else:
            summary = preamble + " " + summary

    newmsg = RepoSurgeonEmail()

    # All messages should have an event number, so this will hopefully
    # raise an exception if it fails
    newmsg['Event-Number'] = msg['Event-Number']
    if "Check-Text" in msg:
        newmsg['Check-Text'] = msg['Check-Text']
    # We must copy the tag name if it exists.
    if "Tag-Name" in msg:
        newmsg['Tag-Name'] = msg['Tag-Name']
    try:
        newmsg.set_payload(summary + "\n\n" + payload)
        return str(newmsg)
    except:
        print("Unicode??? Cannot format message for Event "
              + msg['Event-Number'] +
              (summary if bugid == "" else (" PR " + bugid)))
        return None

def main():
    scrape_bugzilla = True
    for arg in sys.argv[1:]:
        if arg == "--noscrape":
            scrape_bugzilla = False
        else:
            print "Unknown option: " + arg
            exit(1)

    read_bugdb(bugdbfile)
    if scrape_bugzilla:
        try:
            fetchall(buglist)
        except KeyboardInterrupt:
            write_bugdb(bugdbfile)
            exit()

        write_bugdb(bugdbfile)

    commitlog = open(commits_in, "r")
    newlog = open(commits_out, "w")
    unfixed = open(commits_unfixed, "w")

    while True:
        msg = RepoSurgeonEmail.readmsg(commitlog)
        if not msg:
            break
        newmsg = processmsg(email.message_from_string(msg))
        if newmsg:
            newlog.write(newmsg)
        else:
            unfixed.write(msg)

    newlog.close()
    commitlog.close()
    unfixed.close()

if __name__ == '__main__':
    main()

Re: Commit messages and the move to git

Reply via email to