On Sat, Nov 10, 2018 at 8:09 AM Michał Górny <mgo...@gentoo.org> wrote:

> Hi, everyone.
>
> The Gentoo's tbz2/xpak package format is quite old.  We've made a few
> incompatible changes in the past (most notably, allowing non-bzip2
> compression and multi-instance naming) but the core design stayed
> the same.  I think we should consider changing it, for the reasons
> outlined below.
>
> The rough format description can be found in xpak(5).  Basically, it's
> a regular compressed tarball with binary metadata blob appended
> to the end.  As such, it looks like a regular compressed tarball
> to the compression tools (with some ignored junk at the end).
> The metadata is entirely custom format and needs dedicated tools
> to manipulate.
>
>
> The current format has a few advantages whose preserving would probably
> be worthwhile:
>
> + The binary package is a single flat file.
>
> + It is reasonably compatible with regular compressed tarball,
> so the users can unpack it using standard tools (except for metadata).
>
> + The metadata is uncompressed and can be quickly found without touching
> the compressed data.
>
> + The metadata can be updated (e.g. as result of pkgmove) without
> touching the compressed data.
>
>
> However, it has a few disadvantages as well:
>
> - The metadata is entirely custom binary format, requiring dedicated
> tools to read or edit.
>
> - The metadata format is relying on customary behavior of compression
> tools that ignore junk following the compressed data.
>

I agree this is a problem in theory, but I haven't seen it as a problem in
practice. Have you observed any problems around this setup?


>
> - By placing the metadata at the end of file, we make it rather hard to
> read the metadata from remote location (via FTP, HTTP) without fetching
> the whole file.  [NB: it's technically possible but probably not worth
> the effort]


> - By requiring the custom format to be at the end of file, we make it
> impossible to trivially cover it with a OpenPGP signature without
> introducing another custom format.
>

Its trivial to cover with a detached sig, no?


>
> - While the format might allow for some extensibility, it's rather
> evolutionary dead end.
>

I'm not even sure how to quantify this, it just sounds like your subjective
opinion (which is fine, but its not factual.)


>
>
> I think the key points of the new format should be:
>
> 1. It should reuse common file formats as much as possible, with
> inventing as little custom code as possible.
>
> 2. It should allow for easy introspection and editing by users without
> dedicated tools.
>

So I'm less confident in the editing use cases; do users edit their binpkgs
on a regular basis?


>
> 3. The metadata should allow for lookup without fetching the whole
> binary package.
>
> 4. The format should allow for some extensions without having to
> reinvent the wheel every time.
>
> 5. It would be nice to preserve the existing advantages.
>
>
> My proposal
> ===========
>
> Basic format
> ------------
> The base of the format is a regular compressed tarball.  There's no junk
> appended to it but the metadata is stored inside it as
> /var/db/pkg/${PF}.  The contents are as compatible with the actual vdb
> format as possible.
>

Just to clarify, you are suggesting we store the metadata inside the
contents of the binary package itself (e.g. where the other files that get
merged to the liveFS are?) What about collisions?

E.g. I install 'machine-images/gentoo-disk-image-1.2.3' on a machine that
already has 'machine-images/gentoo-disk-image-1.2.3' installed, won't it
overwrite files in the VDB at qmerge time?


>
> This has the following advantages:
>
> + Binary package is still stored as a single file.
>
> + It uses a standard compressed .tar format, with minimal customization.
>
> + The user can easily inspect and modify the packages with standard
> tools (tar and the compressor).
>
> + If we can maintain reasonable level of vdb compatibility, the user can
> even emergency-install a package without causing too much hassle (as it
> will be recorded in vdb); ideally Portage would detect this vdb entry
> and support fixing the install afterwards.
>

I'm not certain this is really desired.


>
>
> Optimizing for easy recognition
> -------------------------------
> In order to make it possible for magic-based tools such as file(1) to
> easily distinguish Gentoo binary packages from regular tarballs, we
> could (ab)use the volume label field, e.g. use:
>
>   $ tar -V 'gpkg: app-foo/bar-1' -c ...
>
> This will add a volume label as the first file entry inside the tarball,
> which does not affect extracting but can be trivially matched via magic
> rules.
>
> Note: this is meant to be used as a method for fast binary package
> recognition; I don't think we should reject (hand-modified) binary
> packages that lack this label.
>
>
> Optimizing for metadata reading/manipulation performance
> --------------------------------------------------------
> The main problem with using a single tarball for both metadata and data
> is that normally you'd have to decompress everything to reliably unpack
> metadata, and recompress everything to update it.  This problem can be
> addressed by a few optimization tricks.
>

These performance goals seem a little bit ill defined.

1) Where are users reporting slowness in binpkg operations?
2) What is the cause of the slowness?

Like I could easily see a potential user with many large binpkgs, and the
current implementation causing them issues because
they have to decompress and seek a bunch to read the metadata out of their
1.2GB binpkg. But i'm pretty sure this isn't most users.


>
> Firstly, all metadata files are packed to the archive before data files.
>  With a slightly customized unpacker, we can stop decompressing as soon
> as we're past metadata and avoid decompressing the whole archive.  This
> will also make it possible to read metadata from remote files without
> fetching far past the compressed metadata block.
>

So this seems to basically go against your goals of simple common tooling?


>
> Secondly, if we're up for some more tricks, we could technically split
> the tarball into metadata and data blocks compressed separately.  This
> will need a bit of archiver customization but it will make it possible
> to decompress the metadata part without even touching compressed data,
> and to replace it without recompressing data.
>
> What's important is that both tricks proposed maintain backwards
> compatibility with regular compressed tarballs.  That is, the user will
> still be able to extract it with regular archiving tools.


So my recollection is that debian uses common format AR files for the main
deb.
Then they have 2 compressed tarballs, one for metadata, and one for data.

This format seems to jive with many of your requirements:

 - 'ar' can retrieve individual files from the archive.
 - The deb file itself is not compressed, but the tarballs inside *are*
compressed.
 - The metadata and data are compressed separately.
 - Anyone can edit this with normal tooling (ar, tar)

In short; why should we event a new format?


>
>
> Adding OpenPGP signatures
> -------------------------
> This is the main XXX here.
>
> Technically, the most obvious solution is to cover the entire tarball
> with OpenPGP signature.  However, this has the disadvantage that
> the verification requires fetching the whole file.
>
> I will look into possibility of having partial signatures.
>
>
> --
> Best regards,
> Michał Górny
>

Reply via email to