CPAN Testers:
As part of my research into Magpie we came up against a disk space
hurdle. Currently CPT is ingesting ~25,000 tests per day. After
capturing a sampling of about 40,000 tests I was able to determine that
the average test is 9,129 bytes of text. If we store uncompressed text
that's 223MB per day (81GB per year). Clearly that's not very
sustainable so we need to look at compression.
* gzip -9 = 3198 bytes
* zstd -12 = 3124 bytes
* brotli -9 = 2699 bytes
Brotli is the clear winner for compressing smallish chunks of text. Not
surprising as that was one of the primary goals when it was designed.
Compressing with Brotli gets us down to 66MB per day (24GB per year)
which is more reasonable for sure.
Doing some research I came across Zstandard dictionaries. Zstandard
dictionaries fit our use case perfectly: compressing many small but very
similar (json, xml, etc.) files. I dumped the last 50,000 text test
results from CPT and created a custom 128KB dictionary file. Using that
*CPT tuned* dictionary I was able to get the average size on disk of a
test result down to 1087 bytes (27MB per day or 10GB per year).
As we move forward with reworking the DB side of CPT we should
definitely consider Zstandard dictionaries. They are well tested,
relatively easy to use, and well supported
<https://metacpan.org/pod/Compress::Stream::Zstd::CompressionDictionary>
by Perl and other tools.
High speed database-grade cloud storage is not cheap. Whatever we can do
to decrease the amount of raw storage we need the better. Lower storage
usage means faster replication and quicker backups. Have you ever tried
backing up 1TB of data in the cloud? Spoiler alert: it's not easy.
-- Scottchiefbaker
P.S. For bonus points what if we re-worked what we store? Do we need to
store "Thank you for uploading your work to CPAN..." Do we need to store
the opening boiler plate paragraph?
From: metabase:user:314402c4-2aae-11df-837a-5e0a49663a4f
Subject: NA Random-Simple-0.24 5.10.1 FreeBSD
Date: 2025-03-31T17:20:02Z
This distribution has been tested as part of the CPAN Testers
project, supporting the Perl programming language. See
http://wiki.cpantesters.org/ for more information or email
questions tocpan-testers-disc...@perl.org
P.P.S. Raw numbers for reference:
perlmagpie> SELECT avg(octet_length(txt_zstd)), count(guid), grade
FROM test_results INNER JOIN test USING (GUID) GROUP BY grade ORDER BY
1 asc LIMIT 30;
+-----------------------+-------+---------+
| avg | count | grade |
|-----------------------+-------+---------|
| 837.1807610993657505 | 1892 | NA |
| 862.9752690411719781 | 72015 | PASS |
| 1286.9555979297194225 | 3671 | UNKNOWN |
| 1515.2728811352688452 | 15362 | FAIL |
+-----------------------+-------+---------+
SELECT 4
Time: 0.223s