The latest outage lasted 5 days. We gave up trying to negotiate with the down 
server and got someone to physically reboot it. Because we still have data in 
MyISAM tables, this comes with a potential for a few issues, not the least of 
which is it can take days to rebuild the MyISAM indexes after a hard reboot 
(luckily that did not happen, and we seem to be back online).

When I joined the project, one of the initial goals was to move away from 
MyISAM on to InnoDB (or, possibly, another DB entirely). My efforts to do that 
continually run in to problems:

* Some parts of the data _will not_ convert to InnoDB as-is due to differences 
between MyISAM and InnoDB.
* The program I wrote to modify that data to a different format which can exist 
in InnoDB will take months to complete.
* Relatedly, I have no reason to suspect moving all that data to a different 
database would take any less time.
* The only reason we need these two servers specifically and solely dedicated 
to the database is because of the database's size

These issues all have a common root: There is a lot of data. I might say too 
much data.

CPAN Testers has accepted 100+ million test reports since it came online. Some 
of these reports are for distributions no longer available on CPAN. Reports are 
still being submitted for abandoned modules not updated in decades for 
out-of-support Perl versions. Every development release of the Perl interpreter 
gets tested against some (most? all?) of CPAN on multiple platforms. This adds 
up to thousands of reports per day, and if the database was up I could check 
what percentage of them are ever visited by human eyes (but my guess is 5-10%).

Even if the data is not seen by humans, it's useful in the aggregate: 
Regression analysis requires as much data as possible to make its hypotheses 
and suggestions. Even if the data is old does not mean it's useless: Old 
versions of modules can still be installable from CPAN, and folks are still 
running old versions of Perl.

That said, timely data is more useful than untimely data. Do we need reports 
submitted in 2006? Data for modules only available on BackPAN isn't actionable, 
so do we need to keep that information?

In the end, irrelevant data is worse than useless, it is actively detrimental 
to the site's stability (as I mentioned above). For that reason, I propose to 
implement the following data retention policies:

1. Full text reports will be kept a maximum of 5 years
2. Report summaries will be kept for all distributions installable from CPAN, 
or if no longer installable from CPAN, 5 years
        * This means that someone will still know if a distribution 
passes/fails, but if an author wants to know why they'll have to reproduce it 
themselves
3. Along with (2), release summaries for distributions not installable from 
CPAN and older than 5 years will be removed
        * This ensures that the release summaries can be rebuilt from the 
report summaries, and that there isn't a strange difference in numbers between 
the CPAN Testers website and consumers of the release data

So, this means that for all distributions available on CPAN, we will still know 
pass/fail/na/unknown and which Perls and platforms. For the first five years 
after the report's submission, one can view the entire text of the report. If 
the distribution is still on CPAN, the full text report will be deleted 5 years 
after it was submitted, but the summary information will remain. If the 
distribution is removed from CPAN, all reports and all summary information 
older than 5 years will be deleted.

Purging report text older than 5 years will reduce the database by about half. 
For the 1TB database we have now, that reduces it to a svelte 500GB. If we 
purge more, we gain more, though report submissions have been increasing over 
the years:

+-----------+----------+----------+----------+----------+---------+
| total     | 5y       | 4y       | 3y       | 2y       | 1y      |
+-----------+----------+----------+----------+----------+---------+
| 107822513 | 62514949 | 48597230 | 35256342 | 21516482 | 9889931 |
+-----------+----------+----------+----------+----------+---------+

So, questions for those affected:

* Do you look at text reports older than 5 years? 3 years? 1 year?
* Are test summaries useful to you without the full text of the report?
* Are pass/fail counts older than 5 years useful to you? 3 years? 1 year?

I'd like to implement this sooner rather than later so I can build some faster 
recovery systems, but I'll leave discussion open at least a week while I 
develop the tools I need to do this anyway.

Doug Bell
d...@preaction.me



Reply via email to