[gentoo-portage-dev] Changing the VDB format

Matt Turner Sun, 13 Mar 2022 18:06:45 -0700

The VDB uses a one-file-per-variable format. This has some
inefficiencies, with many file systems. For example the 'EAPI' file
that contains a single character will consume a 4K block on disk.


$ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
$ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
print sum }'
418517
$ du -sh --apparent-size .
413K    .
$ du -sh .
556K    .

During normal operations, portage has to read each of these 35+
files/package individually.

I suggest that we change the VDB format to a commonly used format that
can be quickly read by portage and any other tools. Combining these
35+ files into a single file with a commonly used format should:

- speed up vdb access
- improve disk usage
- allow external tools to access VDB data more easily

I've attached a program that prints the VDB contents of a specified
package in different formats: json, toml, and yaml (and also Python
PrettyPrinter, just because). I think it's important to keep the VDB
format as plain-text for ease of manipulation, so I have not
considered anything like sqlite.

I expected to prefer toml, but I actually find it to be rather gross looking.

$ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
444663
$ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
385112
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
273428

toml and yaml are formatted in a human-readable manner, but json is
not. Pipe the json output to app-misc/jq to get a better sense of its
structure:

$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq
...

Compare with the raw contents of the files:

$ ls -lh --block-size=1 | grep -v
'\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; }
{ sum += $5; } END { print sum }'
378658

Yes, the json is actually smaller because it does not contain large
amounts of duplicated path strings in CONTENTS (which is 375320 bytes
by itself, or 89% of the total size).

I recommend json and think it is the best choice because:

- json provides the smallest on-disk footprint
- json is part of Python's standard library (so is yaml, and toml will
be in Python 3.11)
- Every programming language has multiple json parsers
-- lots of effort has been spent making them extremely fast.

I think we would have a significant time period for the transition. I
think I would include support for the new format in Portage, and ship
a tool with portage to switch back and forth between old and new
formats on-disk. Maybe after a year, drop the code from Portage to
support the old format?

Thoughts?

#!/usr/bin/env python

import argparse
import json
import pprint
import sys
import toml
import yaml

from pathlib import Path


def main(argv):
    pp = pprint.PrettyPrinter(indent=2)

    parser = argparse.ArgumentParser()
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument('--json', action='store_true')
    group.add_argument('--toml', action='store_true')
    group.add_argument('--yaml', action='store_true')
    group.add_argument('--pprint', action='store_true')
    parser.add_argument('vdbdir', type=str)

    opts = parser.parse_args(argv[1:])

    vdb = Path(opts.vdbdir)
    if not vdb.is_dir():
        print(f'{vdb} is not a directory')
        sys.exit(-1)

    d = {}

    for file in (x for x in vdb.iterdir()):
        if not file.name.isupper():
            # print(f"Ignoring file {file.name}")
            continue

        value = file.read_text().rstrip('\n')

        if file.name == "CONTENTS":
            contents = {}

            for line in value.splitlines(keepends=False):
                (type, *rest) = line.split(sep=' ')
                parts = rest[0].split(sep='/')
                p = contents

                if type == 'dir':
                    assert(len(rest) == 1)

                    for part in parts[1:]:
                        p = p.setdefault(part, {})
                else:
                    for part in parts[1:-1]:
                        p = p.get(part)

                if type == 'obj':
                    assert(len(rest) == 3)
                    p[parts[-1]] = {'hash': rest[1], 'size': rest[2]}
                elif type == 'sym':
                    assert(len(rest) == 4)
                    p[parts[-1]] = {'target': rest[2], 'size': rest[3]}

            d[file.name] = contents

        elif file.name in ('DEFINED_PHASES', 'FEATURES', 'HOMEPAGE',
                           'INHERITED', 'IUSE', 'IUSE_EFFECTIVE', 'LICENSE',
                           'KEYWORDS', 'PKGUSE', 'RESTRICT', 'USE'):
            d[file.name] = value.split(' ')
        else:
            d[file.name] = value

    if opts.json:
        json.dump(d, sys.stdout)
    if opts.toml:
        toml.dump(d, sys.stdout)
    if opts.yaml:
        yaml.dump(d, sys.stdout)
    if opts.pprint:
        pp.pprint(d)


if __name__ == '__main__':
    main(sys.argv)

[gentoo-portage-dev] Changing the VDB format

Reply via email to