[
https://issues.apache.org/jira/browse/COMDEV-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebb updated COMDEV-163:
------------------------
Component/s: Reporter Tool
> mailglomper.py takes ages to run
> --------------------------------
>
> Key: COMDEV-163
> URL: https://issues.apache.org/jira/browse/COMDEV-163
> Project: Community Development
> Issue Type: Bug
> Components: Reporter Tool
> Reporter: Sebb
>
> mailglomper takes a very long time to run (several hours)
> This is mainly because it has to download the last 7 mailboxes for each
> mailing list; some of these mailboxes can be quite large.
> Most of this is wasted processing because only the mailbox for the current
> month is ever updated; once a new month starts, emails are added to the new
> mailbox only and the earlier mailboxes are not updated further.
> It would be more efficient to cache the counts/times for the previous months
> and use those instead of re-reading them. If the cache entry is missing, then
> the file is read.
> How much information needs to be cached for each mailbox?
> For exact compatibility with the current code, it would be necessary to store
> the counts for each day, but if this results in too much storage, then it
> would be possible to store just the weekly counts. This would not affect the
> historic weekly stats.
> However the running quarterly stats currently allocate the email to the
> quaterly buckets on a daily rather than weekly basis, so some precision would
> be lost if only the weekly merged counts were available for past months.
> The cache itself would need managing to ensure that the oldest entries were
> dropped, otherwise it would grow very large.
> Note: since contributions to the weekly buckets may come from more than one
> month, it's likely not feasible to use the existing data. This is because the
> current month is processed multiple times, so its data needs to be replaced
> each time. If its first week overlaps with the last week of the previous
> month, that would result in lost data. This problem might even affect dailiy
> accumulations; it depends exactly when the mailboxes are flipped. Having a
> separate cache entries for each monthly mailbox would also make it easier to
> manage the cache. The downside is that it would require more storage, but the
> cost of re-reading the historic mailboxes every day is relatively large.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)