I don't know a ton about SquashFS but some reading on Wikipedia says
it's a read-only filesystem. How would CPT use SquashFS? Storing report
data?
-- Scottchiefbaker
On 5/8/25 9:04 AM, Doug Bell wrote:
Yeah, a looong long time ago I was hoping Zstd compression +
dictionaries would solve the problem. I had, though, I think, designed
some overly-complex systems for doing it, and therefore never got
around to setting it up.
I did some tests w/ squashfs and got some good results as well. This
option appeals to me for its transparency: The Zstd + dictionary
approach means special tools for looking at the data, but squashfs
would work w/ a standard CLI toolkit. Those results are below.
I'm collecting (heh) up a design spec for this
<https://github.com/orgs/cpan-testers/discussions/24> in the CPAN
Testers Discussions under a new Proposal category. And then once we
isolate this problem, the rest of the problems seem almost trivial ;)
# The count of all reports
cpantesters@cpantesters4:~$ find reports-dir/_meta/timestamp -type
f | xargs cat | wc -l
44987
# The total size on-disk (I'm assuming w/ extra tail blocks)
cpantesters@cpantesters4:~$ du -sh reports-dir/
614M reports-dir/
# LZ4 squashfs
Exportable Squashfs 4.0 filesystem, lz4 compressed, data block
size 131072
compressed data, compressed metadata, compressed
fragments, compressed xattrs
duplicates are removed
Filesystem size 127003.50 Kbytes (124.03 Mbytes)
30.41% of uncompressed filesystem size (417572.73 Kbytes)
Inode table size 889805 bytes (868.95 Kbytes)
39.78% of uncompressed inode table size (2237004 bytes)
Directory table size 823072 bytes (803.78 Kbytes)
36.71% of uncompressed directory table size (2242366 bytes)
# XZ squashfs (best compression)
Exportable Squashfs 4.0 filesystem, xz compressed, data block size
131072
compressed data, compressed metadata, compressed
fragments, compressed xattrs
duplicates are removed
Filesystem size 92831.63 Kbytes (90.66 Mbytes)
22.23% of uncompressed filesystem size (417572.73 Kbytes)
Inode table size 479692 bytes (468.45 Kbytes)
21.44% of uncompressed inode table size (2237004 bytes)
Directory table size 493816 bytes (482.24 Kbytes)
22.02% of uncompressed directory table size (2242366 bytes)
# LZO squashfs
Exportable Squashfs 4.0 filesystem, lzo compressed, data block
size 131072
compressed data, compressed metadata, compressed
fragments, compressed xattrs
duplicates are removed
Filesystem size 119522.29 Kbytes (116.72 Mbytes)
28.62% of uncompressed filesystem size (417572.73 Kbytes)
Inode table size 827963 bytes (808.56 Kbytes)
37.01% of uncompressed inode table size (2237004 bytes)
Directory table size 743654 bytes (726.22 Kbytes)
33.16% of uncompressed directory table size (2242366 bytes)
# Gzip squashfs
Exportable Squashfs 4.0 filesystem, gzip compressed, data block
size 131072
compressed data, compressed metadata, compressed
fragments, compressed xattrs
duplicates are removed
Filesystem size 111798.37 Kbytes (109.18 Mbytes)
26.77% of uncompressed filesystem size (417572.73 Kbytes)
Inode table size 627493 bytes (612.79 Kbytes)
28.05% of uncompressed inode table size (2237004 bytes)
Directory table size 581621 bytes (567.99 Kbytes)
25.94% of uncompressed directory table size
# Ztsd squashfs (needed to move to a Debian 12 box to get this)
Exportable Squashfs 4.0 filesystem, zstd compressed
Filesystem size 100603.81 Kbytes (98.25 Mbytes)
24.09% of uncompressed filesystem size (417572.73 Kbytes)
Inode table size 537209 bytes (524.62 Kbytes)
24.01% of uncompressed inode table size (2237004 bytes)
Directory table size 490852 bytes (479.35 Kbytes)
21.89% of uncompressed directory table size (2242366 bytes)
Doug Bell
d...@preaction.me
On May 5, 2025, at 6:23 PM, Scott Baker <sc...@perturb.org> wrote:
CPAN Testers:
As part of my research into Magpie we came up against a disk space
hurdle. Currently CPT is ingesting ~25,000 tests per day. After
capturing a sampling of about 40,000 tests I was able to determine
that the average test is 9,129 bytes of text. If we store
uncompressed text that's 223MB per day (81GB per year). Clearly
that's not very sustainable so we need to look at compression.
* gzip -9 = 3198 bytes
* zstd -12 = 3124 bytes
* brotli -9 = 2699 bytes
Brotli is the clear winner for compressing smallish chunks of text.
Not surprising as that was one of the primary goals when it was
designed. Compressing with Brotli gets us down to 66MB per day (24GB
per year) which is more reasonable for sure.
Doing some research I came across Zstandard dictionaries. Zstandard
dictionaries fit our use case perfectly: compressing many small but
very similar (json, xml, etc.) files. I dumped the last 50,000 text
test results from CPT and created a custom 128KB dictionary file.
Using that *CPT tuned* dictionary I was able to get the average size
on disk of a test result down to 1087 bytes (27MB per day or 10GB per
year).
As we move forward with reworking the DB side of CPT we should
definitely consider Zstandard dictionaries. They are well tested,
relatively easy to use, and well supported
<https://metacpan.org/pod/Compress::Stream::Zstd::CompressionDictionary>
by Perl and other tools.
High speed database-grade cloud storage is not cheap. Whatever we can
do to decrease the amount of raw storage we need the better. Lower
storage usage means faster replication and quicker backups. Have you
ever tried backing up 1TB of data in the cloud? Spoiler alert: it's
not easy.
-- Scottchiefbaker
P.S. For bonus points what if we re-worked what we store? Do we need
to store "Thank you for uploading your work to CPAN..." Do we need to
store the opening boiler plate paragraph?
From: metabase:user:314402c4-2aae-11df-837a-5e0a49663a4f
Subject: NA Random-Simple-0.24 5.10.1 FreeBSD
Date: 2025-03-31T17:20:02Z
This distribution has been tested as part of the CPAN Testers
project, supporting the Perl programming language. See
http://wiki.cpantesters.org/ for more information or email
questions tocpan-testers-disc...@perl.org
P.P.S. Raw numbers for reference:
perlmagpie> SELECT avg(octet_length(txt_zstd)), count(guid), grade
FROM test_results INNER JOIN test USING (GUID) GROUP BY grade ORDER
BY 1 asc LIMIT 30;
+-----------------------+-------+---------+
| avg | count | grade |
|-----------------------+-------+---------|
| 837.1807610993657505 | 1892 | NA |
| 862.9752690411719781 | 72015 | PASS |
| 1286.9555979297194225 | 3671 | UNKNOWN |
| 1515.2728811352688452 | 15362 | FAIL |
+-----------------------+-------+---------+
SELECT 4
Time: 0.223s