Hi,

I've recently been thinking about this too.

On 13-03-2022 18:06:21 -0700, Matt Turner wrote:
> The VDB uses a one-file-per-variable format. This has some
> inefficiencies, with many file systems. For example the 'EAPI' file
> that contains a single character will consume a 4K block on disk.
> I recommend json and think it is the best choice because:

[snip]

> - json provides the smallest on-disk footprint
> - json is part of Python's standard library (so is yaml, and toml will
> be in Python 3.11)
> - Every programming language has multiple json parsers
> -- lots of effort has been spent making them extremely fast.

I would like to suggest to use "tar".  The reason behind this is a bit
convoluted, but I try to be as clear and sound as I can:
- "new style" bin-packages use tar too
- tar-file allows to keep all individual files/members, e.g. for legacy
  tools to unpack and look at the VDB that way
- tar-file allows streaming, so single file read, for efficient
  retrieval
- single tar-file for entire VDB, allows to make it "atomic", one can
  modify tar archives lateron to add new vdb entries, or perform
  updates -- again, without inplace (e.g. memory backing) this could be
  done atomic)
- tar-file could be used for (rsync) tree metadata (md5-cache) in the
  same way, e.g. re-use streaming approach, or unpack for legacy tools
- tar-file could be used for Packages file, instead of flat file with
  keys, basically just write VDB entries with some additional keys, very
  similar in practise.
- tar-files are slightly easier to manage from command line, tools to do
  so exist for a long time and are installed.  (jq isn't pulled in by
  @system these days, I think)
- tar-files can easily (optionally) be compressed retaining streaming
  abilities (this is for these usages very likely to pay off), a much
  higher dictionary benefit for a single tar vs many files.
- single tar-file is much more efficient to GPG-sign (which would allow
  some securing of the VDB, not sure if useful though)
- going back to the first point, vdb entry from binary package could
  simply be dropped into the vdb tar, and vice-versa
- going back to metadata, dep-resolving could simply load the entire
  system available/installed packages with two reads in memory (if it
  has enough of that -- pretty common these days), which should allow
  for vast speedups, especially on cold(ish) filesystems.

> I think we would have a significant time period for the transition. I
> think I would include support for the new format in Portage, and ship
> a tool with portage to switch back and forth between old and new
> formats on-disk. Maybe after a year, drop the code from Portage to
> support the old format?

Here I believe that with tar-format, initially code could be written to
instead of accessing a file directly, it could open up the tar-file,
locate the member it needs, and then retrieve that instead.  This is a
bit naive, but probably sort of managable, and allows to having a switch
that specifies which format to write.  It's easy to detect which form
you have automatically.  E.g. nothing has to change for users unless
they actively make a change for it.

Like you, I think the main reason for doing this should be performance,
basically allowing faster operations.

I feel though that we should aim to use a single solution to maintain a
number of "trees" that we have: metadata, vdb, Packages/binpkgs, for
they all seem to exhibit a similar (IO) behaviour when being employed.

Thanks,
Fabian

-- 
Fabian Groffen
Gentoo on a different level

Attachment: signature.asc
Description: PGP signature

Reply via email to