The latest outage lasted 5 days. We gave up trying to negotiate with the down
server and got someone to physically reboot it. Because we still have data in
MyISAM tables, this comes with a potential for a few issues, not the least of
which is it can take days to rebuild the MyISAM indexes after a hard reboot
(luckily that did not happen, and we seem to be back online).
When I joined the project, one of the initial goals was to move away from
MyISAM on to InnoDB (or, possibly, another DB entirely). My efforts to do that
continually run in to problems:
* Some parts of the data _will not_ convert to InnoDB as-is due to differences
between MyISAM and InnoDB.
* The program I wrote to modify that data to a different format which can exist
in InnoDB will take months to complete.
* Relatedly, I have no reason to suspect moving all that data to a different
database would take any less time.
* The only reason we need these two servers specifically and solely dedicated
to the database is because of the database's size
These issues all have a common root: There is a lot of data. I might say too
much data.
CPAN Testers has accepted 100+ million test reports since it came online. Some
of these reports are for distributions no longer available on CPAN. Reports are
still being submitted for abandoned modules not updated in decades for
out-of-support Perl versions. Every development release of the Perl interpreter
gets tested against some (most? all?) of CPAN on multiple platforms. This adds
up to thousands of reports per day, and if the database was up I could check
what percentage of them are ever visited by human eyes (but my guess is 5-10%).
Even if the data is not seen by humans, it's useful in the aggregate:
Regression analysis requires as much data as possible to make its hypotheses
and suggestions. Even if the data is old does not mean it's useless: Old
versions of modules can still be installable from CPAN, and folks are still
running old versions of Perl.
That said, timely data is more useful than untimely data. Do we need reports
submitted in 2006? Data for modules only available on BackPAN isn't actionable,
so do we need to keep that information?
In the end, irrelevant data is worse than useless, it is actively detrimental
to the site's stability (as I mentioned above). For that reason, I propose to
implement the following data retention policies:
1. Full text reports will be kept a maximum of 5 years
2. Report summaries will be kept for all distributions installable from CPAN,
or if no longer installable from CPAN, 5 years
* This means that someone will still know if a distribution
passes/fails, but if an author wants to know why they'll have to reproduce it
themselves
3. Along with (2), release summaries for distributions not installable from
CPAN and older than 5 years will be removed
* This ensures that the release summaries can be rebuilt from the
report summaries, and that there isn't a strange difference in numbers between
the CPAN Testers website and consumers of the release data
So, this means that for all distributions available on CPAN, we will still know
pass/fail/na/unknown and which Perls and platforms. For the first five years
after the report's submission, one can view the entire text of the report. If
the distribution is still on CPAN, the full text report will be deleted 5 years
after it was submitted, but the summary information will remain. If the
distribution is removed from CPAN, all reports and all summary information
older than 5 years will be deleted.
Purging report text older than 5 years will reduce the database by about half.
For the 1TB database we have now, that reduces it to a svelte 500GB. If we
purge more, we gain more, though report submissions have been increasing over
the years:
+-----------+----------+----------+----------+----------+---------+
| total | 5y | 4y | 3y | 2y | 1y |
+-----------+----------+----------+----------+----------+---------+
| 107822513 | 62514949 | 48597230 | 35256342 | 21516482 | 9889931 |
+-----------+----------+----------+----------+----------+---------+
So, questions for those affected:
* Do you look at text reports older than 5 years? 3 years? 1 year?
* Are test summaries useful to you without the full text of the report?
* Are pass/fail counts older than 5 years useful to you? 3 years? 1 year?
I'd like to implement this sooner rather than later so I can build some faster
recovery systems, but I'll leave discussion open at least a week while I
develop the tools I need to do this anyway.
Doug Bell
[email protected]