The VDB uses a one-file-per-variable format. This has some inefficiencies, with many file systems. For example the 'EAPI' file that contains a single character will consume a 4K block on disk.
$ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 418517 $ du -sh --apparent-size . 413K . $ du -sh . 556K . During normal operations, portage has to read each of these 35+ files/package individually. I suggest that we change the VDB format to a commonly used format that can be quickly read by portage and any other tools. Combining these 35+ files into a single file with a commonly used format should: - speed up vdb access - improve disk usage - allow external tools to access VDB data more easily I've attached a program that prints the VDB contents of a specified package in different formats: json, toml, and yaml (and also Python PrettyPrinter, just because). I think it's important to keep the VDB format as plain-text for ease of manipulation, so I have not considered anything like sqlite. I expected to prefer toml, but I actually find it to be rather gross looking. $ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 444663 $ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 385112 $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 273428 toml and yaml are formatted in a human-readable manner, but json is not. Pipe the json output to app-misc/jq to get a better sense of its structure: $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq ... Compare with the raw contents of the files: $ ls -lh --block-size=1 | grep -v '\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 378658 Yes, the json is actually smaller because it does not contain large amounts of duplicated path strings in CONTENTS (which is 375320 bytes by itself, or 89% of the total size). I recommend json and think it is the best choice because: - json provides the smallest on-disk footprint - json is part of Python's standard library (so is yaml, and toml will be in Python 3.11) - Every programming language has multiple json parsers -- lots of effort has been spent making them extremely fast. I think we would have a significant time period for the transition. I think I would include support for the new format in Portage, and ship a tool with portage to switch back and forth between old and new formats on-disk. Maybe after a year, drop the code from Portage to support the old format? Thoughts?
#!/usr/bin/env python import argparse import json import pprint import sys import toml import yaml from pathlib import Path def main(argv): pp = pprint.PrettyPrinter(indent=2) parser = argparse.ArgumentParser() group = parser.add_mutually_exclusive_group(required=True) group.add_argument('--json', action='store_true') group.add_argument('--toml', action='store_true') group.add_argument('--yaml', action='store_true') group.add_argument('--pprint', action='store_true') parser.add_argument('vdbdir', type=str) opts = parser.parse_args(argv[1:]) vdb = Path(opts.vdbdir) if not vdb.is_dir(): print(f'{vdb} is not a directory') sys.exit(-1) d = {} for file in (x for x in vdb.iterdir()): if not file.name.isupper(): # print(f"Ignoring file {file.name}") continue value = file.read_text().rstrip('\n') if file.name == "CONTENTS": contents = {} for line in value.splitlines(keepends=False): (type, *rest) = line.split(sep=' ') parts = rest[0].split(sep='/') p = contents if type == 'dir': assert(len(rest) == 1) for part in parts[1:]: p = p.setdefault(part, {}) else: for part in parts[1:-1]: p = p.get(part) if type == 'obj': assert(len(rest) == 3) p[parts[-1]] = {'hash': rest[1], 'size': rest[2]} elif type == 'sym': assert(len(rest) == 4) p[parts[-1]] = {'target': rest[2], 'size': rest[3]} d[file.name] = contents elif file.name in ('DEFINED_PHASES', 'FEATURES', 'HOMEPAGE', 'INHERITED', 'IUSE', 'IUSE_EFFECTIVE', 'LICENSE', 'KEYWORDS', 'PKGUSE', 'RESTRICT', 'USE'): d[file.name] = value.split(' ') else: d[file.name] = value if opts.json: json.dump(d, sys.stdout) if opts.toml: toml.dump(d, sys.stdout) if opts.yaml: yaml.dump(d, sys.stdout) if opts.pprint: pp.pprint(d) if __name__ == '__main__': main(sys.argv)