On 3/13/2022 21:06, Matt Turner wrote:
> The VDB uses a one-file-per-variable format. This has some
> inefficiencies, with many file systems. For example the 'EAPI' file
> that contains a single character will consume a 4K block on disk.
> 
> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
> print sum }'
> 418517
> $ du -sh --apparent-size .
> 413K    .
> $ du -sh .
> 556K    .
> 
> During normal operations, portage has to read each of these 35+
> files/package individually.
> 
> I suggest that we change the VDB format to a commonly used format that
> can be quickly read by portage and any other tools. Combining these
> 35+ files into a single file with a commonly used format should:
> 
> - speed up vdb access
> - improve disk usage
> - allow external tools to access VDB data more easily
> 
> I've attached a program that prints the VDB contents of a specified
> package in different formats: json, toml, and yaml (and also Python
> PrettyPrinter, just because). I think it's important to keep the VDB
> format as plain-text for ease of manipulation, so I have not
> considered anything like sqlite.
> 
> I expected to prefer toml, but I actually find it to be rather gross looking.

Agreed, the toml output is rather "cluttered" looking.


> I recommend json and think it is the best choice because:
> 
> - json provides the smallest on-disk footprint
> - json is part of Python's standard library (so is yaml, and toml will
> be in Python 3.11)
> - Every programming language has multiple json parsers
> -- lots of effort has been spent making them extremely fast.
> 
> I think we would have a significant time period for the transition. I
> think I would include support for the new format in Portage, and ship
> a tool with portage to switch back and forth between old and new
> formats on-disk. Maybe after a year, drop the code from Portage to
> support the old format?
> 
> Thoughts?

I think json is the best format for storing the data on-disk.  It's intended
to be a data serialization format to convert data from a non-specific memory
format to a storable on-disk format and back again, so this is a perfect use
for it.

That said, I actually do like the yaml output as well, but I think the
better use-case for that would be in the form of a secondary tool that maybe
could be a part of portage's 'q' commands (qpkg, qfile, qlist, etc) to read
the JSON-formatted VDB data and export it in yaml for review.  Something
like 'qvdb --yaml sys-libs/glibc-2.35-r2' to dump the VDB data to stdout
(and maybe do other tasks, but that's a discussion for another thread).

As far as support for the old format goes, I think one year is too short.
Two years is preferable, though I would not be totally opposed to as long as
three years.  Adoption could probably be helped by turning this vdb.py
script into something more functional that can actually walk the current VDB
and convert it to the new chosen format and write that out to an alternate
location that a user could then transplant into /var/db/pkg after verifying it.

One other thought -- I think there should be a tuning knob in make.conf to
enable compression of the VDB's new format or not.  The specific compression
format I leave up for debate (I'd say go with zstd, though), but if I am
running on a filesystem that supports native compression (e.g., ZFS), I'd
want to turn VDB compression off and let ZFS handle that at the filesystem
level.  But on another system with say, XFS, I'd want to turn that on to get
some benefits, especially on older hardware that's going to be more I/O bound.

E.g., in JSON format, sys-libs/glibc-2.35-r2 clocks in at ~345KB:

# ./vdb.py --json /var/db/pkg/sys-libs/glibc-2.35-r2 > glibc.json
# ls -lh --block-size=1 glibc.json
-rw-r--r-- 1 root root 352479 Apr 11 14:53 glibc.json

# zstd glibc.json
glibc.json           : 21.70%   (   344 KiB =>   74.7 KiB, glibc.json.zst)

# ls -lh --block-size=1 glibc.json.zst
-rw-r--r-- 1 root root 76498 Apr 11 14:53 glibc.json.zst

(this is on a tmpfs-formatted ramdrive)

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic

Reply via email to