Sebb created COMDEV-163:
---------------------------
Summary: mailglomper.py takes ages to run
Key: COMDEV-163
URL: https://issues.apache.org/jira/browse/COMDEV-163
Project: Community Development
Issue Type: Bug
Reporter: Sebb
mailglomper takes a very long time to run (several hours)
This is mainly because it has to download the last 7 mailboxes for each mailing
list; some of these mailboxes can be quite large.
Most of this is wasted processing because only the mailbox for the current
month is ever updated; once a new month starts, emails are added to the new
mailbox only and the earlier mailboxes are not updated further.
It would be more efficient to cache the counts/times for the previous months
and use those instead of re-reading them. If the cache entry is missing, then
the file is read.
How much information needs to be cached for each mailbox?
For exact compatibility with the current code, it would be necessary to store
the counts for each day, but if this results in too much storage, then it would
be possible to store just the weekly counts. This would not affect the historic
weekly stats.
However the running quarterly stats currently allocate the email to the
quaterly buckets on a daily rather than weekly basis, so some precision would
be lost if only the weekly merged counts were available for past months.
The cache itself would need managing to ensure that the oldest entries were
dropped, otherwise it would grow very large.
Note: since contributions to the weekly buckets may come from more than one
month, it's likely not feasible to use the existing data. This is because the
current month is processed multiple times, so its data needs to be replaced
each time. If its first week overlaps with the last week of the previous month,
that would result in lost data. This problem might even affect dailiy
accumulations; it depends exactly when the mailboxes are flipped. Having a
separate cache entries for each monthly mailbox would also make it easier to
manage the cache. The downside is that it would require more storage, but the
cost of re-reading the historic mailboxes every day is relatively large.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)