Hi, I've recently been thinking about this too.
On 13-03-2022 18:06:21 -0700, Matt Turner wrote: > The VDB uses a one-file-per-variable format. This has some > inefficiencies, with many file systems. For example the 'EAPI' file > that contains a single character will consume a 4K block on disk. > I recommend json and think it is the best choice because: [snip] > - json provides the smallest on-disk footprint > - json is part of Python's standard library (so is yaml, and toml will > be in Python 3.11) > - Every programming language has multiple json parsers > -- lots of effort has been spent making them extremely fast. I would like to suggest to use "tar". The reason behind this is a bit convoluted, but I try to be as clear and sound as I can: - "new style" bin-packages use tar too - tar-file allows to keep all individual files/members, e.g. for legacy tools to unpack and look at the VDB that way - tar-file allows streaming, so single file read, for efficient retrieval - single tar-file for entire VDB, allows to make it "atomic", one can modify tar archives lateron to add new vdb entries, or perform updates -- again, without inplace (e.g. memory backing) this could be done atomic) - tar-file could be used for (rsync) tree metadata (md5-cache) in the same way, e.g. re-use streaming approach, or unpack for legacy tools - tar-file could be used for Packages file, instead of flat file with keys, basically just write VDB entries with some additional keys, very similar in practise. - tar-files are slightly easier to manage from command line, tools to do so exist for a long time and are installed. (jq isn't pulled in by @system these days, I think) - tar-files can easily (optionally) be compressed retaining streaming abilities (this is for these usages very likely to pay off), a much higher dictionary benefit for a single tar vs many files. - single tar-file is much more efficient to GPG-sign (which would allow some securing of the VDB, not sure if useful though) - going back to the first point, vdb entry from binary package could simply be dropped into the vdb tar, and vice-versa - going back to metadata, dep-resolving could simply load the entire system available/installed packages with two reads in memory (if it has enough of that -- pretty common these days), which should allow for vast speedups, especially on cold(ish) filesystems. > I think we would have a significant time period for the transition. I > think I would include support for the new format in Portage, and ship > a tool with portage to switch back and forth between old and new > formats on-disk. Maybe after a year, drop the code from Portage to > support the old format? Here I believe that with tar-format, initially code could be written to instead of accessing a file directly, it could open up the tar-file, locate the member it needs, and then retrieve that instead. This is a bit naive, but probably sort of managable, and allows to having a switch that specifies which format to write. It's easy to detect which form you have automatically. E.g. nothing has to change for users unless they actively make a change for it. Like you, I think the main reason for doing this should be performance, basically allowing faster operations. I feel though that we should aim to use a single solution to maintain a number of "trees" that we have: metadata, vdb, Packages/binpkgs, for they all seem to exhibit a similar (IO) behaviour when being employed. Thanks, Fabian -- Fabian Groffen Gentoo on a different level
signature.asc
Description: PGP signature