Perhaps dump it all out to sql, xz it up, then restore it later if someone really wants it?
Or split up the sql dump and process it on a number of machines? Dean On 2019-10-18 04:33, Doug Bell wrote: > The latest outage lasted 5 days. We gave up trying to negotiate with the down > server and got someone to physically reboot it. Because we still have data in > MyISAM tables, this comes with a potential for a few issues, not the least of > which is it can take days to rebuild the MyISAM indexes after a hard reboot > (luckily that did not happen, and we seem to be back online). > > When I joined the project, one of the initial goals was to move away from > MyISAM on to InnoDB (or, possibly, another DB entirely). My efforts to do > that continually run in to problems: > > * Some parts of the data _will not_ convert to InnoDB as-is due to > differences between MyISAM and InnoDB. > * The program I wrote to modify that data to a different format which can > exist in InnoDB will take months to complete. > * Relatedly, I have no reason to suspect moving all that data to a different > database would take any less time. > * The only reason we need these two servers specifically and solely dedicated > to the database is because of the database's size > > These issues all have a common root: There is a lot of data. I might say too > much data. > > CPAN Testers has accepted 100+ million test reports since it came online. > Some of these reports are for distributions no longer available on CPAN. > Reports are still being submitted for abandoned modules not updated in > decades for out-of-support Perl versions. Every development release of the > Perl interpreter gets tested against some (most? all?) of CPAN on multiple > platforms. This adds up to thousands of reports per day, and if the database > was up I could check what percentage of them are ever visited by human eyes > (but my guess is 5-10%). > > Even if the data is not seen by humans, it's useful in the aggregate: > Regression analysis requires as much data as possible to make its hypotheses > and suggestions. Even if the data is old does not mean it's useless: Old > versions of modules can still be installable from CPAN, and folks are still > running old versions of Perl. > > That said, timely data is more useful than untimely data. Do we need reports > submitted in 2006? Data for modules only available on BackPAN isn't > actionable, so do we need to keep that information? > > In the end, irrelevant data is worse than useless, it is actively detrimental > to the site's stability (as I mentioned above). For that reason, I propose to > implement the following data retention policies: > > 1. Full text reports will be kept a maximum of 5 years > 2. Report summaries will be kept for all distributions installable from CPAN, > or if no longer installable from CPAN, 5 years > * This means that someone will still know if a distribution passes/fails, but > if an author wants to know why they'll have to reproduce it themselves > 3. Along with (2), release summaries for distributions not installable from > CPAN and older than 5 years will be removed > * This ensures that the release summaries can be rebuilt from the report > summaries, and that there isn't a strange difference in numbers between the > CPAN Testers website and consumers of the release data > > So, this means that for all distributions available on CPAN, we will still > know pass/fail/na/unknown and which Perls and platforms. For the first five > years after the report's submission, one can view the entire text of the > report. If the distribution is still on CPAN, the full text report will be > deleted 5 years after it was submitted, but the summary information will > remain. If the distribution is removed from CPAN, all reports and all summary > information older than 5 years will be deleted. > > Purging report text older than 5 years will reduce the database by about > half. For the 1TB database we have now, that reduces it to a svelte 500GB. If > we purge more, we gain more, though report submissions have been increasing > over the years: > > +-----------+----------+----------+----------+----------+---------+ > | total | 5y | 4y | 3y | 2y | 1y | > +-----------+----------+----------+----------+----------+---------+ > | 107822513 | 62514949 | 48597230 | 35256342 | 21516482 | 9889931 | > +-----------+----------+----------+----------+----------+---------+ > > So, questions for those affected: > > * Do you look at text reports older than 5 years? 3 years? 1 year? > * Are test summaries useful to you without the full text of the report? > * Are pass/fail counts older than 5 years useful to you? 3 years? 1 year? > > I'd like to implement this sooner rather than later so I can build some > faster recovery systems, but I'll leave discussion open at least a week while > I develop the tools I need to do this anyway. > > Doug Bell > d...@preaction.me