On Mon, Apr 11, 2022, at 3:02 PM, Joshua Kinard wrote:
> On 3/13/2022 21:06, Matt Turner wrote:
>> The VDB uses a one-file-per-variable format. This has some
>> inefficiencies, with many file systems. For example the 'EAPI' file
>> that contains a single character will consume a 4K block on disk.
>>
>> $ cd /var/db/pkg/sys-apps/portage-3.0.30-r1/
>> $ ls -lh --block-size=1 | awk 'BEGIN { sum = 0; } { sum += $5; } END {
>> print sum }'
>> 418517
>> $ du -sh --apparent-size .
>> 413K .
>> $ du -sh .
>> 556K .
>>
>> During normal operations, portage has to read each of these 35+
>> files/package individually.
>>
>> I suggest that we change the VDB format to a commonly used format that
>> can be quickly read by portage and any other tools. Combining these
>> 35+ files into a single file with a commonly used format should:
>>
>> - speed up vdb access
>> - improve disk usage
>> - allow external tools to access VDB data more easily
>>
>> I've attached a program that prints the VDB contents of a specified
>> package in different formats: json, toml, and yaml (and also Python
>> PrettyPrinter, just because). I think it's important to keep the VDB
>> format as plain-text for ease of manipulation, so I have not
>> considered anything like sqlite.
>>
>> I expected to prefer toml, but I actually find it to be rather gross looking.
>
> Agreed, the toml output is rather "cluttered" looking.
>
>
>> I recommend json and think it is the best choice because:
>>
>> - json provides the smallest on-disk footprint
>> - json is part of Python's standard library (so is yaml, and toml will
>> be in Python 3.11)
>> - Every programming language has multiple json parsers
>> -- lots of effort has been spent making them extremely fast.
>>
>> I think we would have a significant time period for the transition. I
>> think I would include support for the new format in Portage, and ship
>> a tool with portage to switch back and forth between old and new
>> formats on-disk. Maybe after a year, drop the code from Portage to
>> support the old format?
>>
>> Thoughts?
>
> I think json is the best format for storing the data on-disk. It's intended
> to be a data serialization format to convert data from a non-specific memory
> format to a storable on-disk format and back again, so this is a perfect use
> for it.
Can we avoid adding another format? I find json very hard to edit by hand, it's
good at storing lots of data in a quasi-textual format, but is strict enough to
be
obnoxious to work with.
Can the files not be concatenated? Doing so is similar to the tar suggestion,
but would keep everything very portage-like. Have the contents assigned to
variables. I am betting someone tried this at the start but settled on the
current
scheme. Does anyone know why? (This would have to be done in bash syntax
I assume.)
Alternatively, I think the tar suggestion is quite elegant. There's streaming
decompressors you can use from python. It adds an extra step to modify but
that could be handled transparently by a dev mode. In dev mode, leave the files
after extraction and do not re-extract, for release mode replace the archive
with
what is on disk.
Sid.