Sebb created COMDEV-163:
---------------------------

             Summary: mailglomper.py takes ages to run
                 Key: COMDEV-163
                 URL: https://issues.apache.org/jira/browse/COMDEV-163
             Project: Community Development
          Issue Type: Bug
            Reporter: Sebb


mailglomper takes a very long time to run (several hours)

This is mainly because it has to download the last 7 mailboxes for each mailing 
list; some of these mailboxes can be quite large.

Most of this is wasted processing because only the mailbox for the current 
month is ever updated; once a new month starts, emails are added to the new 
mailbox only and the earlier mailboxes are not updated further.

It would be more efficient to cache the counts/times for the previous months 
and use those instead of re-reading them. If the cache entry is missing, then 
the file is read.

How much information needs to be cached for each mailbox?
For exact compatibility with the current code, it would be necessary to store 
the counts for each day, but if this results in too much storage, then it would 
be possible to store just the weekly counts. This would not affect the historic 
weekly stats.

However the running quarterly stats currently allocate the email to the 
quaterly buckets on a daily rather than weekly basis, so some precision would 
be lost if only the weekly merged counts were available for past months.

The cache itself would need managing to ensure that the oldest entries were 
dropped, otherwise it would grow very large.

Note: since contributions to the weekly buckets may come from more than one 
month, it's likely not feasible to use the existing data. This is because the 
current month is processed multiple times, so its data needs to be replaced 
each time. If its first week overlaps with the last week of the previous month, 
that would result in lost data. This problem might even affect dailiy 
accumulations; it depends exactly when the mailboxes are flipped. Having a 
separate cache entries for each monthly mailbox would also make it easier to 
manage the cache. The downside is that it would require more storage, but the 
cost of re-reading the historic mailboxes every day is relatively large.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to