Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Joshua Kinard
On 4/11/2022 15:20, Sid Spry wrote:
> On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
>>
>> I think json is the best format for storing the data on-disk.  It's intended
>> to be a data serialization format to convert data from a non-specific memory
>> format to a storable on-disk format and back again, so this is a perfect use
>> for it.
> 
> Can we avoid adding another format? I find json very hard to edit by hand, 
> it's
> good at storing lots of data in a quasi-textual format, but is strict enough 
> to be
> obnoxious to work with.

I sympathize with you here on this, json is not a format geared for editing
by hand...  :: looks disapprovingly at net-misc/kea ::

> Can the files not be concatenated? Doing so is similar to the tar suggestion,
> but would keep everything very portage-like. Have the contents assigned to
> variables. I am betting someone tried this at the start but settled on the 
> current
> scheme. Does anyone know why? (This would have to be done in bash syntax
> I assume.)
> 
> Alternatively, I think the tar suggestion is quite elegant. There's streaming
> decompressors you can use from python. It adds an extra step to modify but
> that could be handled transparently by a dev mode. In dev mode, leave the 
> files
> after extraction and do not re-extract, for release mode replace the archive 
> with
> what is on disk.

Out of curiosity, what are you doing that requires manual editing of the VDB
data?  That data isn't, in normal scenarios, supposed to be arbitrarily
edited.  Throwing out a good optimization for what sounds like a niche
corner-case doesn't seem like a great plan.  Given json's malleable format,
and especially Matt's example script, converting from json data to another
format that's more conducive to manual editing in rare circumstances, then
converting back, is not impossible.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Sid Spry
On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
> On 3/13/2022 21:06, Matt Turner wrote:
>> The VDB uses a one-file-per-variable format. This has some
>> inefficiencies, with many file systems. For example the 'EAPI' file
>> that contains a single character will consume a 4K block on disk.
>> 
>> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
>> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
>> print sum }'
>> 418517
>> $ du -sh --apparent-size .
>> 413K.
>> $ du -sh .
>> 556K.
>> 
>> During normal operations, portage has to read each of these 35+
>> files/package individually.
>> 
>> I suggest that we change the VDB format to a commonly used format that
>> can be quickly read by portage and any other tools. Combining these
>> 35+ files into a single file with a commonly used format should:
>> 
>> - speed up vdb access
>> - improve disk usage
>> - allow external tools to access VDB data more easily
>> 
>> I've attached a program that prints the VDB contents of a specified
>> package in different formats: json, toml, and yaml (and also Python
>> PrettyPrinter, just because). I think it's important to keep the VDB
>> format as plain-text for ease of manipulation, so I have not
>> considered anything like sqlite.
>> 
>> I expected to prefer toml, but I actually find it to be rather gross looking.
>
> Agreed, the toml output is rather "cluttered" looking.
>
>
>> I recommend json and think it is the best choice because:
>> 
>> - json provides the smallest on-disk footprint
>> - json is part of Python's standard library (so is yaml, and toml will
>> be in Python 3.11)
>> - Every programming language has multiple json parsers
>> -- lots of effort has been spent making them extremely fast.
>> 
>> I think we would have a significant time period for the transition. I
>> think I would include support for the new format in Portage, and ship
>> a tool with portage to switch back and forth between old and new
>> formats on-disk. Maybe after a year, drop the code from Portage to
>> support the old format?
>> 
>> Thoughts?
>
> I think json is the best format for storing the data on-disk.  It's intended
> to be a data serialization format to convert data from a non-specific memory
> format to a storable on-disk format and back again, so this is a perfect use
> for it.

Can we avoid adding another format? I find json very hard to edit by hand, it's
good at storing lots of data in a quasi-textual format, but is strict enough to 
be
obnoxious to work with.

Can the files not be concatenated? Doing so is similar to the tar suggestion,
but would keep everything very portage-like. Have the contents assigned to
variables. I am betting someone tried this at the start but settled on the 
current
scheme. Does anyone know why? (This would have to be done in bash syntax
I assume.)

Alternatively, I think the tar suggestion is quite elegant. There's streaming
decompressors you can use from python. It adds an extra step to modify but
that could be handled transparently by a dev mode. In dev mode, leave the files
after extraction and do not re-extract, for release mode replace the archive 
with
what is on disk.

Sid.



Re: [gentoo-portage-dev] Changing the VDB format

2022-04-11 Thread Joshua Kinard
On 3/13/2022 21:06, Matt Turner wrote:
> The VDB uses a one-file-per-variable format. This has some
> inefficiencies, with many file systems. For example the 'EAPI' file
> that contains a single character will consume a 4K block on disk.
> 
> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
> print sum }'
> 418517
> $ du -sh --apparent-size .
> 413K.
> $ du -sh .
> 556K.
> 
> During normal operations, portage has to read each of these 35+
> files/package individually.
> 
> I suggest that we change the VDB format to a commonly used format that
> can be quickly read by portage and any other tools. Combining these
> 35+ files into a single file with a commonly used format should:
> 
> - speed up vdb access
> - improve disk usage
> - allow external tools to access VDB data more easily
> 
> I've attached a program that prints the VDB contents of a specified
> package in different formats: json, toml, and yaml (and also Python
> PrettyPrinter, just because). I think it's important to keep the VDB
> format as plain-text for ease of manipulation, so I have not
> considered anything like sqlite.
> 
> I expected to prefer toml, but I actually find it to be rather gross looking.

Agreed, the toml output is rather "cluttered" looking.


> I recommend json and think it is the best choice because:
> 
> - json provides the smallest on-disk footprint
> - json is part of Python's standard library (so is yaml, and toml will
> be in Python 3.11)
> - Every programming language has multiple json parsers
> -- lots of effort has been spent making them extremely fast.
> 
> I think we would have a significant time period for the transition. I
> think I would include support for the new format in Portage, and ship
> a tool with portage to switch back and forth between old and new
> formats on-disk. Maybe after a year, drop the code from Portage to
> support the old format?
> 
> Thoughts?

I think json is the best format for storing the data on-disk.  It's intended
to be a data serialization format to convert data from a non-specific memory
format to a storable on-disk format and back again, so this is a perfect use
for it.

That said, I actually do like the yaml output as well, but I think the
better use-case for that would be in the form of a secondary tool that maybe
could be a part of portage's 'q' commands (qpkg, qfile, qlist, etc) to read
the JSON-formatted VDB data and export it in yaml for review.  Something
like 'qvdb --yaml sys-libs/glibc-2.35-r2' to dump the VDB data to stdout
(and maybe do other tasks, but that's a discussion for another thread).

As far as support for the old format goes, I think one year is too short.
Two years is preferable, though I would not be totally opposed to as long as
three years.  Adoption could probably be helped by turning this vdb.py
script into something more functional that can actually walk the current VDB
and convert it to the new chosen format and write that out to an alternate
location that a user could then transplant into /var/db/pkg after verifying it.

One other thought -- I think there should be a tuning knob in make.conf to
enable compression of the VDB's new format or not.  The specific compression
format I leave up for debate (I'd say go with zstd, though), but if I am
running on a filesystem that supports native compression (e.g., ZFS), I'd
want to turn VDB compression off and let ZFS handle that at the filesystem
level.  But on another system with say, XFS, I'd want to turn that on to get
some benefits, especially on older hardware that's going to be more I/O bound.

E.g., in JSON format, sys-libs/glibc-2.35-r2 clocks in at ~345KB:

# ./vdb.py --json /var/db/pkg/sys-libs/glibc-2.35-r2 > glibc.json
# ls -lh --block-size=1 glibc.json
-rw-r--r-- 1 root root 352479 Apr 11 14:53 glibc.json

# zstd glibc.json
glibc.json   : 21.70%   (   344 KiB =>   74.7 KiB, glibc.json.zst)

# ls -lh --block-size=1 glibc.json.zst
-rw-r--r-- 1 root root 76498 Apr 11 14:53 glibc.json.zst

(this is on a tmpfs-formatted ramdrive)

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic