CPAN Testers:

As part of my research into Magpie we came up against a disk space hurdle. Currently CPT is ingesting ~25,000 tests per day. After capturing a sampling of about 40,000 tests I was able to determine that the average test is 9,129 bytes of text. If we store uncompressed text that's 223MB per day (81GB per year). Clearly that's not very sustainable so we need to look at compression.

 * gzip -9 = 3198 bytes
 * zstd -12 = 3124 bytes
 * brotli -9 = 2699 bytes

Brotli is the clear winner for compressing smallish chunks of text. Not surprising as that was one of the primary goals when it was designed. Compressing with Brotli gets us down to 66MB per day (24GB per year) which is more reasonable for sure.

Doing some research I came across Zstandard dictionaries. Zstandard dictionaries fit our use case perfectly: compressing many small but very similar (json, xml, etc.) files. I dumped the last 50,000 text test results from CPT and created a custom 128KB dictionary file. Using that *CPT tuned* dictionary I was able to get the average size on disk of a test result down to 1087 bytes (27MB per day or 10GB per year).

As we move forward with reworking the DB side of CPT we should definitely consider Zstandard dictionaries. They are well tested, relatively easy to use, and well supported <https://metacpan.org/pod/Compress::Stream::Zstd::CompressionDictionary> by Perl and other tools.

High speed database-grade cloud storage is not cheap. Whatever we can do to decrease the amount of raw storage we need the better. Lower storage usage means faster replication and quicker backups. Have you ever tried backing up 1TB of data in the cloud? Spoiler alert: it's not easy.

-- Scottchiefbaker

P.S. For bonus points what if we re-worked what we store? Do we need to store "Thank you for uploading your work to CPAN..." Do we need to store the opening boiler plate paragraph?

From: metabase:user:314402c4-2aae-11df-837a-5e0a49663a4f
Subject: NA Random-Simple-0.24 5.10.1 FreeBSD
Date: 2025-03-31T17:20:02Z

This distribution has been tested as part of the CPAN Testers
project, supporting the Perl programming language.  See
http://wiki.cpantesters.org/ for more information or email
questions tocpan-testers-disc...@perl.org
P.P.S. Raw numbers for reference:

perlmagpie> SELECT avg(octet_length(txt_zstd)), count(guid), grade FROM test_results INNER JOIN test USING (GUID) GROUP BY grade ORDER BY 1 asc LIMIT 30;
+-----------------------+-------+---------+
| avg                   | count | grade   |
|-----------------------+-------+---------|
| 837.1807610993657505  | 1892  | NA      |
| 862.9752690411719781  | 72015 | PASS    |
| 1286.9555979297194225 | 3671  | UNKNOWN |
| 1515.2728811352688452 | 15362 | FAIL    |
+-----------------------+-------+---------+
SELECT 4
Time: 0.223s

Reply via email to