[ 
https://issues.apache.org/jira/browse/COMDEV-163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebb updated COMDEV-163:
------------------------
    Component/s: Reporter Tool

> mailglomper.py takes ages to run
> --------------------------------
>
>                 Key: COMDEV-163
>                 URL: https://issues.apache.org/jira/browse/COMDEV-163
>             Project: Community Development
>          Issue Type: Bug
>          Components: Reporter Tool
>            Reporter: Sebb
>
> mailglomper takes a very long time to run (several hours)
> This is mainly because it has to download the last 7 mailboxes for each 
> mailing list; some of these mailboxes can be quite large.
> Most of this is wasted processing because only the mailbox for the current 
> month is ever updated; once a new month starts, emails are added to the new 
> mailbox only and the earlier mailboxes are not updated further.
> It would be more efficient to cache the counts/times for the previous months 
> and use those instead of re-reading them. If the cache entry is missing, then 
> the file is read.
> How much information needs to be cached for each mailbox?
> For exact compatibility with the current code, it would be necessary to store 
> the counts for each day, but if this results in too much storage, then it 
> would be possible to store just the weekly counts. This would not affect the 
> historic weekly stats.
> However the running quarterly stats currently allocate the email to the 
> quaterly buckets on a daily rather than weekly basis, so some precision would 
> be lost if only the weekly merged counts were available for past months.
> The cache itself would need managing to ensure that the oldest entries were 
> dropped, otherwise it would grow very large.
> Note: since contributions to the weekly buckets may come from more than one 
> month, it's likely not feasible to use the existing data. This is because the 
> current month is processed multiple times, so its data needs to be replaced 
> each time. If its first week overlaps with the last week of the previous 
> month, that would result in lost data. This problem might even affect dailiy 
> accumulations; it depends exactly when the mailboxes are flipped. Having a 
> separate cache entries for each monthly mailbox would also make it easier to 
> manage the cache. The downside is that it would require more storage, but the 
> cost of re-reading the historic mailboxes every day is relatively large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to