Re: [gentoo-portage-dev] Changing the VDB format
On 4/11/2022 15:20, Sid Spry wrote: > On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote: >> >> I think json is the best format for storing the data on-disk. It's intended >> to be a data serialization format to convert data from a non-specific memory >> format to a storable on-disk format and back again, so this is a perfect use >> for it. > > Can we avoid adding another format? I find json very hard to edit by hand, > it's > good at storing lots of data in a quasi-textual format, but is strict enough > to be > obnoxious to work with. I sympathize with you here on this, json is not a format geared for editing by hand... :: looks disapprovingly at net-misc/kea :: > Can the files not be concatenated? Doing so is similar to the tar suggestion, > but would keep everything very portage-like. Have the contents assigned to > variables. I am betting someone tried this at the start but settled on the > current > scheme. Does anyone know why? (This would have to be done in bash syntax > I assume.) > > Alternatively, I think the tar suggestion is quite elegant. There's streaming > decompressors you can use from python. It adds an extra step to modify but > that could be handled transparently by a dev mode. In dev mode, leave the > files > after extraction and do not re-extract, for release mode replace the archive > with > what is on disk. Out of curiosity, what are you doing that requires manual editing of the VDB data? That data isn't, in normal scenarios, supposed to be arbitrarily edited. Throwing out a good optimization for what sounds like a niche corner-case doesn't seem like a great plan. Given json's malleable format, and especially Matt's example script, converting from json data to another format that's more conducive to manual editing in rare circumstances, then converting back, is not impossible. -- Joshua Kinard Gentoo/MIPS ku...@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic
Re: [gentoo-portage-dev] Changing the VDB format
On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote: > On 3/13/2022 21:06, Matt Turner wrote: >> The VDB uses a one-file-per-variable format. This has some >> inefficiencies, with many file systems. For example the 'EAPI' file >> that contains a single character will consume a 4K block on disk. >> >> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ >> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { >> print sum }' >> 418517 >> $ du -sh --apparent-size . >> 413K. >> $ du -sh . >> 556K. >> >> During normal operations, portage has to read each of these 35+ >> files/package individually. >> >> I suggest that we change the VDB format to a commonly used format that >> can be quickly read by portage and any other tools. Combining these >> 35+ files into a single file with a commonly used format should: >> >> - speed up vdb access >> - improve disk usage >> - allow external tools to access VDB data more easily >> >> I've attached a program that prints the VDB contents of a specified >> package in different formats: json, toml, and yaml (and also Python >> PrettyPrinter, just because). I think it's important to keep the VDB >> format as plain-text for ease of manipulation, so I have not >> considered anything like sqlite. >> >> I expected to prefer toml, but I actually find it to be rather gross looking. > > Agreed, the toml output is rather "cluttered" looking. > > >> I recommend json and think it is the best choice because: >> >> - json provides the smallest on-disk footprint >> - json is part of Python's standard library (so is yaml, and toml will >> be in Python 3.11) >> - Every programming language has multiple json parsers >> -- lots of effort has been spent making them extremely fast. >> >> I think we would have a significant time period for the transition. I >> think I would include support for the new format in Portage, and ship >> a tool with portage to switch back and forth between old and new >> formats on-disk. Maybe after a year, drop the code from Portage to >> support the old format? >> >> Thoughts? > > I think json is the best format for storing the data on-disk. It's intended > to be a data serialization format to convert data from a non-specific memory > format to a storable on-disk format and back again, so this is a perfect use > for it. Can we avoid adding another format? I find json very hard to edit by hand, it's good at storing lots of data in a quasi-textual format, but is strict enough to be obnoxious to work with. Can the files not be concatenated? Doing so is similar to the tar suggestion, but would keep everything very portage-like. Have the contents assigned to variables. I am betting someone tried this at the start but settled on the current scheme. Does anyone know why? (This would have to be done in bash syntax I assume.) Alternatively, I think the tar suggestion is quite elegant. There's streaming decompressors you can use from python. It adds an extra step to modify but that could be handled transparently by a dev mode. In dev mode, leave the files after extraction and do not re-extract, for release mode replace the archive with what is on disk. Sid.
Re: [gentoo-portage-dev] Changing the VDB format
On 3/13/2022 21:06, Matt Turner wrote: > The VDB uses a one-file-per-variable format. This has some > inefficiencies, with many file systems. For example the 'EAPI' file > that contains a single character will consume a 4K block on disk. > > $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ > $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { > print sum }' > 418517 > $ du -sh --apparent-size . > 413K. > $ du -sh . > 556K. > > During normal operations, portage has to read each of these 35+ > files/package individually. > > I suggest that we change the VDB format to a commonly used format that > can be quickly read by portage and any other tools. Combining these > 35+ files into a single file with a commonly used format should: > > - speed up vdb access > - improve disk usage > - allow external tools to access VDB data more easily > > I've attached a program that prints the VDB contents of a specified > package in different formats: json, toml, and yaml (and also Python > PrettyPrinter, just because). I think it's important to keep the VDB > format as plain-text for ease of manipulation, so I have not > considered anything like sqlite. > > I expected to prefer toml, but I actually find it to be rather gross looking. Agreed, the toml output is rather "cluttered" looking. > I recommend json and think it is the best choice because: > > - json provides the smallest on-disk footprint > - json is part of Python's standard library (so is yaml, and toml will > be in Python 3.11) > - Every programming language has multiple json parsers > -- lots of effort has been spent making them extremely fast. > > I think we would have a significant time period for the transition. I > think I would include support for the new format in Portage, and ship > a tool with portage to switch back and forth between old and new > formats on-disk. Maybe after a year, drop the code from Portage to > support the old format? > > Thoughts? I think json is the best format for storing the data on-disk. It's intended to be a data serialization format to convert data from a non-specific memory format to a storable on-disk format and back again, so this is a perfect use for it. That said, I actually do like the yaml output as well, but I think the better use-case for that would be in the form of a secondary tool that maybe could be a part of portage's 'q' commands (qpkg, qfile, qlist, etc) to read the JSON-formatted VDB data and export it in yaml for review. Something like 'qvdb --yaml sys-libs/glibc-2.35-r2' to dump the VDB data to stdout (and maybe do other tasks, but that's a discussion for another thread). As far as support for the old format goes, I think one year is too short. Two years is preferable, though I would not be totally opposed to as long as three years. Adoption could probably be helped by turning this vdb.py script into something more functional that can actually walk the current VDB and convert it to the new chosen format and write that out to an alternate location that a user could then transplant into /var/db/pkg after verifying it. One other thought -- I think there should be a tuning knob in make.conf to enable compression of the VDB's new format or not. The specific compression format I leave up for debate (I'd say go with zstd, though), but if I am running on a filesystem that supports native compression (e.g., ZFS), I'd want to turn VDB compression off and let ZFS handle that at the filesystem level. But on another system with say, XFS, I'd want to turn that on to get some benefits, especially on older hardware that's going to be more I/O bound. E.g., in JSON format, sys-libs/glibc-2.35-r2 clocks in at ~345KB: # ./vdb.py --json /var/db/pkg/sys-libs/glibc-2.35-r2 > glibc.json # ls -lh --block-size=1 glibc.json -rw-r--r-- 1 root root 352479 Apr 11 14:53 glibc.json # zstd glibc.json glibc.json : 21.70% ( 344 KiB => 74.7 KiB, glibc.json.zst) # ls -lh --block-size=1 glibc.json.zst -rw-r--r-- 1 root root 76498 Apr 11 14:53 glibc.json.zst (this is on a tmpfs-formatted ramdrive) -- Joshua Kinard Gentoo/MIPS ku...@gentoo.org rsa6144/5C63F4E3F5C6C943 2015-04-27 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943 "The past tempts us, the present confuses us, the future frightens us. And our lives slip away, moment by moment, lost in that vast, terrible in-between." --Emperor Turhan, Centauri Republic
Re: [gentoo-portage-dev] Changing the VDB format
On 14/03/2022 13.22, Fabian Groffen wrote: Hi, I've recently been thinking about this too. On 13-03-2022 18:06:21 -0700, Matt Turner wrote: The VDB uses a one-file-per-variable format. This has some inefficiencies, with many file systems. For example the 'EAPI' file that contains a single character will consume a 4K block on disk. I recommend json and think it is the best choice because: [snip] - json provides the smallest on-disk footprint - json is part of Python's standard library (so is yaml, and toml will be in Python 3.11) - Every programming language has multiple json parsers -- lots of effort has been spent making them extremely fast. I would like to suggest to use "tar". Your idea sounds very appealing and I am by no means an expert to the tar file format but https://www.gnu.org/software/tar/manual/html_node/Standard.html states """ …an archive consists of a series of file entries terminated by an end-of-archive entry, which consists of two 512 blocks of zero bytes. """ and the Wikipedia entry of 'tar' [1] states """ Each file object includes any file data, and is preceded by a 512-byte header record. The file data is written unaltered except that its length is rounded up to a multiple of 512 bytes. """ and furthermore """ The end of an archive is marked by at least two consecutive zero-filled records. """ Which sounds like a lot of overhead if no compression is involved. Not sure if this can be considered a knock out criteria for tar. - Flow
Re: [gentoo-portage-dev] Changing the VDB format
> "MT" == Matt Turner writes: MT> For example the 'EAPI' file MT> that contains a single character will consume a 4K block on disk. the sort of filesystes one expects to be used for /var all store short files either directly in the inode or in the directory entry¹. that said, Fabian’s suggestion of tar(1)ing those files sounds like a winner. a number of tools and editors allow r/w access to tar(5) files, including to the individual files therein. 1] or at least unix/linux filesystems all used too -JimC -- James Cloos OpenPGP: 0x997A9F17ED7DAEA6
Re: [gentoo-portage-dev] Changing the VDB format
Hi, I've recently been thinking about this too. On 13-03-2022 18:06:21 -0700, Matt Turner wrote: > The VDB uses a one-file-per-variable format. This has some > inefficiencies, with many file systems. For example the 'EAPI' file > that contains a single character will consume a 4K block on disk. > I recommend json and think it is the best choice because: [snip] > - json provides the smallest on-disk footprint > - json is part of Python's standard library (so is yaml, and toml will > be in Python 3.11) > - Every programming language has multiple json parsers > -- lots of effort has been spent making them extremely fast. I would like to suggest to use "tar". The reason behind this is a bit convoluted, but I try to be as clear and sound as I can: - "new style" bin-packages use tar too - tar-file allows to keep all individual files/members, e.g. for legacy tools to unpack and look at the VDB that way - tar-file allows streaming, so single file read, for efficient retrieval - single tar-file for entire VDB, allows to make it "atomic", one can modify tar archives lateron to add new vdb entries, or perform updates -- again, without inplace (e.g. memory backing) this could be done atomic) - tar-file could be used for (rsync) tree metadata (md5-cache) in the same way, e.g. re-use streaming approach, or unpack for legacy tools - tar-file could be used for Packages file, instead of flat file with keys, basically just write VDB entries with some additional keys, very similar in practise. - tar-files are slightly easier to manage from command line, tools to do so exist for a long time and are installed. (jq isn't pulled in by @system these days, I think) - tar-files can easily (optionally) be compressed retaining streaming abilities (this is for these usages very likely to pay off), a much higher dictionary benefit for a single tar vs many files. - single tar-file is much more efficient to GPG-sign (which would allow some securing of the VDB, not sure if useful though) - going back to the first point, vdb entry from binary package could simply be dropped into the vdb tar, and vice-versa - going back to metadata, dep-resolving could simply load the entire system available/installed packages with two reads in memory (if it has enough of that -- pretty common these days), which should allow for vast speedups, especially on cold(ish) filesystems. > I think we would have a significant time period for the transition. I > think I would include support for the new format in Portage, and ship > a tool with portage to switch back and forth between old and new > formats on-disk. Maybe after a year, drop the code from Portage to > support the old format? Here I believe that with tar-format, initially code could be written to instead of accessing a file directly, it could open up the tar-file, locate the member it needs, and then retrieve that instead. This is a bit naive, but probably sort of managable, and allows to having a switch that specifies which format to write. It's easy to detect which form you have automatically. E.g. nothing has to change for users unless they actively make a change for it. Like you, I think the main reason for doing this should be performance, basically allowing faster operations. I feel though that we should aim to use a single solution to maintain a number of "trees" that we have: metadata, vdb, Packages/binpkgs, for they all seem to exhibit a similar (IO) behaviour when being employed. Thanks, Fabian -- Fabian Groffen Gentoo on a different level signature.asc Description: PGP signature
[gentoo-portage-dev] Changing the VDB format
The VDB uses a one-file-per-variable format. This has some inefficiencies, with many file systems. For example the 'EAPI' file that contains a single character will consume a 4K block on disk. $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/ $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 418517 $ du -sh --apparent-size . 413K. $ du -sh . 556K. During normal operations, portage has to read each of these 35+ files/package individually. I suggest that we change the VDB format to a commonly used format that can be quickly read by portage and any other tools. Combining these 35+ files into a single file with a commonly used format should: - speed up vdb access - improve disk usage - allow external tools to access VDB data more easily I've attached a program that prints the VDB contents of a specified package in different formats: json, toml, and yaml (and also Python PrettyPrinter, just because). I think it's important to keep the VDB format as plain-text for ease of manipulation, so I have not considered anything like sqlite. I expected to prefer toml, but I actually find it to be rather gross looking. $ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 444663 $ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 385112 $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c 273428 toml and yaml are formatted in a human-readable manner, but json is not. Pipe the json output to app-misc/jq to get a better sense of its structure: $ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq ... Compare with the raw contents of the files: $ ls -lh --block-size=1 | grep -v '\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; } { sum += $5; } END { print sum }' 378658 Yes, the json is actually smaller because it does not contain large amounts of duplicated path strings in CONTENTS (which is 375320 bytes by itself, or 89% of the total size). I recommend json and think it is the best choice because: - json provides the smallest on-disk footprint - json is part of Python's standard library (so is yaml, and toml will be in Python 3.11) - Every programming language has multiple json parsers -- lots of effort has been spent making them extremely fast. I think we would have a significant time period for the transition. I think I would include support for the new format in Portage, and ship a tool with portage to switch back and forth between old and new formats on-disk. Maybe after a year, drop the code from Portage to support the old format? Thoughts? #!/usr/bin/env python import argparse import json import pprint import sys import toml import yaml from pathlib import Path def main(argv): pp = pprint.PrettyPrinter(indent=2) parser = argparse.ArgumentParser() group = parser.add_mutually_exclusive_group(required=True) group.add_argument('--json', action='store_true') group.add_argument('--toml', action='store_true') group.add_argument('--yaml', action='store_true') group.add_argument('--pprint', action='store_true') parser.add_argument('vdbdir', type=str) opts = parser.parse_args(argv[1:]) vdb = Path(opts.vdbdir) if not vdb.is_dir(): print(f'{vdb} is not a directory') sys.exit(-1) d = {} for file in (x for x in vdb.iterdir()): if not file.name.isupper(): # print(f"Ignoring file {file.name}") continue value = file.read_text().rstrip('\n') if file.name == "CONTENTS": contents = {} for line in value.splitlines(keepends=False): (type, *rest) = line.split(sep=' ') parts = rest[0].split(sep='/') p = contents if type == 'dir': assert(len(rest) == 1) for part in parts[1:]: p = p.setdefault(part, {}) else: for part in parts[1:-1]: p = p.get(part) if type == 'obj': assert(len(rest) == 3) p[parts[-1]] = {'hash': rest[1], 'size': rest[2]} elif type == 'sym': assert(len(rest) == 4) p[parts[-1]] = {'target': rest[2], 'size': rest[3]} d[file.name] = contents elif file.name in ('DEFINED_PHASES', 'FEATURES', 'HOMEPAGE', 'INHERITED', 'IUSE', 'IUSE_EFFECTIVE', 'LICENSE', 'KEYWORDS', 'PKGUSE', 'RESTRICT', 'USE'): d[file.name] = value.split(' ') else: d[file.name] = value if opts.json: json.dump(d, sys.stdout) if opts.toml: toml.dump(d, sys.stdout) if opts.yaml: yaml.dump(d, sys.stdout) if opts.pprint: pp.pprint(d) if __name__ == '__main__': main(sys.argv)