Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Joshua Kinard
On 4/11/2022 15:20, Sid Spry wrote:
> On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
>>
>> I think json is the best format for storing the data on-disk.  It's intended
>> to be a data serialization format to convert data from a non-specific memory
>> format to a storable on-disk format and back again, so this is a perfect use
>> for it.
> 
> Can we avoid adding another format? I find json very hard to edit by hand, 
> it's
> good at storing lots of data in a quasi-textual format, but is strict enough 
> to be
> obnoxious to work with.

I sympathize with you here on this, json is not a format geared for editing
by hand...  :: looks disapprovingly at net-misc/kea ::

> Can the files not be concatenated? Doing so is similar to the tar suggestion,
> but would keep everything very portage-like. Have the contents assigned to
> variables. I am betting someone tried this at the start but settled on the 
> current
> scheme. Does anyone know why? (This would have to be done in bash syntax
> I assume.)
> 
> Alternatively, I think the tar suggestion is quite elegant. There's streaming
> decompressors you can use from python. It adds an extra step to modify but
> that could be handled transparently by a dev mode. In dev mode, leave the 
> files
> after extraction and do not re-extract, for release mode replace the archive 
> with
> what is on disk.

Out of curiosity, what are you doing that requires manual editing of the VDB
data?  That data isn't, in normal scenarios, supposed to be arbitrarily
edited.  Throwing out a good optimization for what sounds like a niche
corner-case doesn't seem like a great plan.  Given json's malleable format,
and especially Matt's example script, converting from json data to another
format that's more conducive to manual editing in rare circumstances, then
converting back, is not impossible.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Sid Spry
On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
> On 3/13/2022 21:06, Matt Turner wrote:
>> The VDB uses a one-file-per-variable format. This has some
>> inefficiencies, with many file systems. For example the 'EAPI' file
>> that contains a single character will consume a 4K block on disk.
>> 
>> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
>> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
>> print sum }'
>> 418517
>> $ du -sh --apparent-size .
>> 413K.
>> $ du -sh .
>> 556K.
>> 
>> During normal operations, portage has to read each of these 35+
>> files/package individually.
>> 
>> I suggest that we change the VDB format to a commonly used format that
>> can be quickly read by portage and any other tools. Combining these
>> 35+ files into a single file with a commonly used format should:
>> 
>> - speed up vdb access
>> - improve disk usage
>> - allow external tools to access VDB data more easily
>> 
>> I've attached a program that prints the VDB contents of a specified
>> package in different formats: json, toml, and yaml (and also Python
>> PrettyPrinter, just because). I think it's important to keep the VDB
>> format as plain-text for ease of manipulation, so I have not
>> considered anything like sqlite.
>> 
>> I expected to prefer toml, but I actually find it to be rather gross looking.
>
> Agreed, the toml output is rather "cluttered" looking.
>
>
>> I recommend json and think it is the best choice because:
>> 
>> - json provides the smallest on-disk footprint
>> - json is part of Python's standard library (so is yaml, and toml will
>> be in Python 3.11)
>> - Every programming language has multiple json parsers
>> -- lots of effort has been spent making them extremely fast.
>> 
>> I think we would have a significant time period for the transition. I
>> think I would include support for the new format in Portage, and ship
>> a tool with portage to switch back and forth between old and new
>> formats on-disk. Maybe after a year, drop the code from Portage to
>> support the old format?
>> 
>> Thoughts?
>
> I think json is the best format for storing the data on-disk.  It's intended
> to be a data serialization format to convert data from a non-specific memory
> format to a storable on-disk format and back again, so this is a perfect use
> for it.

Can we avoid adding another format? I find json very hard to edit by hand, it's
good at storing lots of data in a quasi-textual format, but is strict enough to 
be
obnoxious to work with.

Can the files not be concatenated? Doing so is similar to the tar suggestion,
but would keep everything very portage-like. Have the contents assigned to
variables. I am betting someone tried this at the start but settled on the 
current
scheme. Does anyone know why? (This would have to be done in bash syntax
I assume.)

Alternatively, I think the tar suggestion is quite elegant. There's streaming
decompressors you can use from python. It adds an extra step to modify but
that could be handled transparently by a dev mode. In dev mode, leave the files
after extraction and do not re-extract, for release mode replace the archive 
with
what is on disk.

Sid.



Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Joshua Kinard
On 3/13/2022 21:06, Matt Turner wrote:
> The VDB uses a one-file-per-variable format. This has some
> inefficiencies, with many file systems. For example the 'EAPI' file
> that contains a single character will consume a 4K block on disk.
> 
> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
> print sum }'
> 418517
> $ du -sh --apparent-size .
> 413K.
> $ du -sh .
> 556K.
> 
> During normal operations, portage has to read each of these 35+
> files/package individually.
> 
> I suggest that we change the VDB format to a commonly used format that
> can be quickly read by portage and any other tools. Combining these
> 35+ files into a single file with a commonly used format should:
> 
> - speed up vdb access
> - improve disk usage
> - allow external tools to access VDB data more easily
> 
> I've attached a program that prints the VDB contents of a specified
> package in different formats: json, toml, and yaml (and also Python
> PrettyPrinter, just because). I think it's important to keep the VDB
> format as plain-text for ease of manipulation, so I have not
> considered anything like sqlite.
> 
> I expected to prefer toml, but I actually find it to be rather gross looking.

Agreed, the toml output is rather "cluttered" looking.


> I recommend json and think it is the best choice because:
> 
> - json provides the smallest on-disk footprint
> - json is part of Python's standard library (so is yaml, and toml will
> be in Python 3.11)
> - Every programming language has multiple json parsers
> -- lots of effort has been spent making them extremely fast.
> 
> I think we would have a significant time period for the transition. I
> think I would include support for the new format in Portage, and ship
> a tool with portage to switch back and forth between old and new
> formats on-disk. Maybe after a year, drop the code from Portage to
> support the old format?
> 
> Thoughts?

I think json is the best format for storing the data on-disk.  It's intended
to be a data serialization format to convert data from a non-specific memory
format to a storable on-disk format and back again, so this is a perfect use
for it.

That said, I actually do like the yaml output as well, but I think the
better use-case for that would be in the form of a secondary tool that maybe
could be a part of portage's 'q' commands (qpkg, qfile, qlist, etc) to read
the JSON-formatted VDB data and export it in yaml for review.  Something
like 'qvdb --yaml sys-libs/glibc-2.35-r2' to dump the VDB data to stdout
(and maybe do other tasks, but that's a discussion for another thread).

As far as support for the old format goes, I think one year is too short.
Two years is preferable, though I would not be totally opposed to as long as
three years.  Adoption could probably be helped by turning this vdb.py
script into something more functional that can actually walk the current VDB
and convert it to the new chosen format and write that out to an alternate
location that a user could then transplant into /var/db/pkg after verifying it.

One other thought -- I think there should be a tuning knob in make.conf to
enable compression of the VDB's new format or not.  The specific compression
format I leave up for debate (I'd say go with zstd, though), but if I am
running on a filesystem that supports native compression (e.g., ZFS), I'd
want to turn VDB compression off and let ZFS handle that at the filesystem
level.  But on another system with say, XFS, I'd want to turn that on to get
some benefits, especially on older hardware that's going to be more I/O bound.

E.g., in JSON format, sys-libs/glibc-2.35-r2 clocks in at ~345KB:

# ./vdb.py --json /var/db/pkg/sys-libs/glibc-2.35-r2 > glibc.json
# ls -lh --block-size=1 glibc.json
-rw-r--r-- 1 root root 352479 Apr 11 14:53 glibc.json

# zstd glibc.json
glibc.json   : 21.70%   (   344 KiB =>   74.7 KiB, glibc.json.zst)

# ls -lh --block-size=1 glibc.json.zst
-rw-r--r-- 1 root root 76498 Apr 11 14:53 glibc.json.zst

(this is on a tmpfs-formatted ramdrive)

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-portage-dev] Changing the VDB format

2022-03-14 Thread Florian Schmaus

On 14/03/2022 13.22, Fabian Groffen wrote:

Hi,

I've recently been thinking about this too.

On 13-03-2022 18:06:21 -0700, Matt Turner wrote:

The VDB uses a one-file-per-variable format. This has some
inefficiencies, with many file systems. For example the 'EAPI' file
that contains a single character will consume a 4K block on disk.
I recommend json and think it is the best choice because:


[snip]


- json provides the smallest on-disk footprint
- json is part of Python's standard library (so is yaml, and toml will
be in Python 3.11)
- Every programming language has multiple json parsers
-- lots of effort has been spent making them extremely fast.


I would like to suggest to use "tar".


Your idea sounds very appealing and I am by no means an expert to the 
tar file format but 
https://www.gnu.org/software/tar/manual/html_node/Standard.html states


"""
…an archive consists of a series of file entries terminated by an 
end-of-archive entry, which consists of two 512 blocks of zero bytes.

"""

and the Wikipedia entry of 'tar' [1] states

"""
Each file object includes any file data, and is preceded by a 512-byte 
header record. The file data is written unaltered except that its length 
is rounded up to a multiple of 512 bytes.

"""

and furthermore

"""
The end of an archive is marked by at least two consecutive zero-filled 
records.

"""

Which sounds like a lot of overhead if no compression is involved. Not 
sure if this can be considered a knock out criteria for tar.


- Flow





Re: [gentoo-portage-dev] Changing the VDB format

2022-03-14 Thread James Cloos
> "MT" == Matt Turner  writes:

MT> For example the 'EAPI' file
MT> that contains a single character will consume a 4K block on disk.

the sort of filesystes one expects to be used for /var all store short
files either directly in the inode or in the directory entry¹.

that said, Fabian’s suggestion of tar(1)ing those files sounds like a
winner.

a number of tools and editors allow r/w access to tar(5) files,
including to the individual files therein.

1] or at least unix/linux filesystems all used too

-JimC
-- 
James Cloos  OpenPGP: 0x997A9F17ED7DAEA6



Re: [gentoo-portage-dev] Changing the VDB format

2022-03-14 Thread Fabian Groffen
Hi,

I've recently been thinking about this too.

On 13-03-2022 18:06:21 -0700, Matt Turner wrote:
> The VDB uses a one-file-per-variable format. This has some
> inefficiencies, with many file systems. For example the 'EAPI' file
> that contains a single character will consume a 4K block on disk.
> I recommend json and think it is the best choice because:

[snip]

> - json provides the smallest on-disk footprint
> - json is part of Python's standard library (so is yaml, and toml will
> be in Python 3.11)
> - Every programming language has multiple json parsers
> -- lots of effort has been spent making them extremely fast.

I would like to suggest to use "tar".  The reason behind this is a bit
convoluted, but I try to be as clear and sound as I can:
- "new style" bin-packages use tar too
- tar-file allows to keep all individual files/members, e.g. for legacy
  tools to unpack and look at the VDB that way
- tar-file allows streaming, so single file read, for efficient
  retrieval
- single tar-file for entire VDB, allows to make it "atomic", one can
  modify tar archives lateron to add new vdb entries, or perform
  updates -- again, without inplace (e.g. memory backing) this could be
  done atomic)
- tar-file could be used for (rsync) tree metadata (md5-cache) in the
  same way, e.g. re-use streaming approach, or unpack for legacy tools
- tar-file could be used for Packages file, instead of flat file with
  keys, basically just write VDB entries with some additional keys, very
  similar in practise.
- tar-files are slightly easier to manage from command line, tools to do
  so exist for a long time and are installed.  (jq isn't pulled in by
  @system these days, I think)
- tar-files can easily (optionally) be compressed retaining streaming
  abilities (this is for these usages very likely to pay off), a much
  higher dictionary benefit for a single tar vs many files.
- single tar-file is much more efficient to GPG-sign (which would allow
  some securing of the VDB, not sure if useful though)
- going back to the first point, vdb entry from binary package could
  simply be dropped into the vdb tar, and vice-versa
- going back to metadata, dep-resolving could simply load the entire
  system available/installed packages with two reads in memory (if it
  has enough of that -- pretty common these days), which should allow
  for vast speedups, especially on cold(ish) filesystems.

> I think we would have a significant time period for the transition. I
> think I would include support for the new format in Portage, and ship
> a tool with portage to switch back and forth between old and new
> formats on-disk. Maybe after a year, drop the code from Portage to
> support the old format?

Here I believe that with tar-format, initially code could be written to
instead of accessing a file directly, it could open up the tar-file,
locate the member it needs, and then retrieve that instead.  This is a
bit naive, but probably sort of managable, and allows to having a switch
that specifies which format to write.  It's easy to detect which form
you have automatically.  E.g. nothing has to change for users unless
they actively make a change for it.

Like you, I think the main reason for doing this should be performance,
basically allowing faster operations.

I feel though that we should aim to use a single solution to maintain a
number of "trees" that we have: metadata, vdb, Packages/binpkgs, for
they all seem to exhibit a similar (IO) behaviour when being employed.

Thanks,
Fabian

-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


[gentoo-portage-dev] Changing the VDB format

2022-03-13 Thread Matt Turner
The VDB uses a one-file-per-variable format. This has some
inefficiencies, with many file systems. For example the 'EAPI' file
that contains a single character will consume a 4K block on disk.

$ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
$ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
print sum }'
418517
$ du -sh --apparent-size .
413K.
$ du -sh .
556K.

During normal operations, portage has to read each of these 35+
files/package individually.

I suggest that we change the VDB format to a commonly used format that
can be quickly read by portage and any other tools. Combining these
35+ files into a single file with a commonly used format should:

- speed up vdb access
- improve disk usage
- allow external tools to access VDB data more easily

I've attached a program that prints the VDB contents of a specified
package in different formats: json, toml, and yaml (and also Python
PrettyPrinter, just because). I think it's important to keep the VDB
format as plain-text for ease of manipulation, so I have not
considered anything like sqlite.

I expected to prefer toml, but I actually find it to be rather gross looking.

$ ~/vdb.py --toml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
444663
$ ~/vdb.py --yaml /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
385112
$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | wc -c
273428

toml and yaml are formatted in a human-readable manner, but json is
not. Pipe the json output to app-misc/jq to get a better sense of its
structure:

$ ~/vdb.py --json /var/db/pkg/sys-apps/portage-3.0.30-r1/ | jq
...

Compare with the raw contents of the files:

$ ls -lh --block-size=1 | grep -v
'\(environment.bz2\|repository\|\.ebuild\)' | awk 'BEGIN { sum = 0; }
{ sum += $5; } END { print sum }'
378658

Yes, the json is actually smaller because it does not contain large
amounts of duplicated path strings in CONTENTS (which is 375320 bytes
by itself, or 89% of the total size).

I recommend json and think it is the best choice because:

- json provides the smallest on-disk footprint
- json is part of Python's standard library (so is yaml, and toml will
be in Python 3.11)
- Every programming language has multiple json parsers
-- lots of effort has been spent making them extremely fast.

I think we would have a significant time period for the transition. I
think I would include support for the new format in Portage, and ship
a tool with portage to switch back and forth between old and new
formats on-disk. Maybe after a year, drop the code from Portage to
support the old format?

Thoughts?
#!/usr/bin/env python

import argparse
import json
import pprint
import sys
import toml
import yaml

from pathlib import Path


def main(argv):
pp = pprint.PrettyPrinter(indent=2)

parser = argparse.ArgumentParser()
group = parser.add_mutually_exclusive_group(required=True)
group.add_argument('--json', action='store_true')
group.add_argument('--toml', action='store_true')
group.add_argument('--yaml', action='store_true')
group.add_argument('--pprint', action='store_true')
parser.add_argument('vdbdir', type=str)

opts = parser.parse_args(argv[1:])

vdb = Path(opts.vdbdir)
if not vdb.is_dir():
print(f'{vdb} is not a directory')
sys.exit(-1)

d = {}

for file in (x for x in vdb.iterdir()):
if not file.name.isupper():
# print(f"Ignoring file {file.name}")
continue

value = file.read_text().rstrip('\n')

if file.name == "CONTENTS":
contents = {}

for line in value.splitlines(keepends=False):
(type, *rest) = line.split(sep=' ')
parts = rest[0].split(sep='/')
p = contents

if type == 'dir':
assert(len(rest) == 1)

for part in parts[1:]:
p = p.setdefault(part, {})
else:
for part in parts[1:-1]:
p = p.get(part)

if type == 'obj':
assert(len(rest) == 3)
p[parts[-1]] = {'hash': rest[1], 'size': rest[2]}
elif type == 'sym':
assert(len(rest) == 4)
p[parts[-1]] = {'target': rest[2], 'size': rest[3]}

d[file.name] = contents

elif file.name in ('DEFINED_PHASES', 'FEATURES', 'HOMEPAGE',
   'INHERITED', 'IUSE', 'IUSE_EFFECTIVE', 'LICENSE',
   'KEYWORDS', 'PKGUSE', 'RESTRICT', 'USE'):
d[file.name] = value.split(' ')
else:
d[file.name] = value

if opts.json:
json.dump(d, sys.stdout)
if opts.toml:
toml.dump(d, sys.stdout)
if opts.yaml:
yaml.dump(d, sys.stdout)
if opts.pprint:
pp.pprint(d)


if __name__ == '__main__':
main(sys.argv)