[ https://issues.apache.org/jira/browse/COMDEV-161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14907977#comment-14907977 ]
Sebb commented on COMDEV-161: ----------------------------- Furthermore, the RE matches anywhere in the string, not just at the start of a line. > mailglomper.py may count a message multiple times > ------------------------------------------------- > > Key: COMDEV-161 > URL: https://issues.apache.org/jira/browse/COMDEV-161 > Project: Community Development > Issue Type: Bug > Components: Reporter Tool > Reporter: Sebb > > The mailglomper.py script counts messages by matching /Date: (.*)/. > It is looking to match header lines of the form: > Date: Thu, 01 May 2008 05:06:51 +0000 > However such lines are not guaranteed to be unique within a message. > In particular SVN commit messages have a "Date:" line which matches, and the > parsed timestamp will be much the same as the header date. For example: > Author: cml > Date: Wed Sep 16 19:06:03 2015 > New Revision: 1703436 > The mailbox format currently used by the ASF guarantees that each message is > prefixed with a line in the format: > From u...@example.com Thu May 01 05:10:32 2008 > [Lines in the message body starting "From " are prefixed as ">From "; the > prefix is removed when messages are extracted] > Only lines starting "From " are guaranteed not to occur in message bodies. > The problem is trivial to fix, but it will change the generated statistics, > particularly for mailboxes that receive SVN commit messages (Git commits use > a different prefix for the timestamp). SVN mails will generally be counted > twice. -- This message was sent by Atlassian JIRA (v6.3.4#6332)