> Sent: Friday, November 16, 2018 at 10:07 PM
> From: "Jonathan Dieter" <jdie...@gmail.com>
> To: "Development discussions related to Fedora" 
> <devel@lists.fedoraproject.org>
> Subject: Proposal: Faster composes by eliminating deltarpms and using 
> zchunked rpms instead
>
> For reference, this is in reply to Paul's email about lifecycle
> objectives, specifically focusing on problem statement #1[1].
> 
> <tl;dr>
> Have rpm use zchunk as its compression format, removing the need for
> deltarpms, and thus reducing compose time.  This will require changes
> to both the rpm format and new features in the zchunk format.
> </tl;dr>
> 
> *deltarpm background*
> As part of the compose process, deltarpms are generated between each
> new rpm and both the GA version of the rpm and the previous version. 
> This process is very CPU and memory intensive, especially for large
> rpms.
> 
> This also means that deltarpms are only useful for an end user if they
> are either updating from GA or have been diligent about keeping their
> system up-to-date.  If a user is updating a package from N-2 to N,
> there will be no deltarpm and the full rpm will be downloaded.
> 
> *zchunk background*
> As some are aware, I've been working on zchunk[2], a compression format
> that's designed for highly efficient deltas, and using it minimize
> metadata downloads[3].
> 
> The core idea behind zchunk is that a file is split into independently
> compressed chunks and the checksum of each compressed chunk is stored
> in the zchunk header.  When downloading a new version of the file, you
> download the zchunk header first, check which chunks you already have,
> and then download the rest.
> 
> *Proposal*
> My proposal would be to make zchunk the rpm compression format for
> Fedora.  This would involve a few additions to the zchunk format[4]
> (something the format has been designed to accommodate), and would
> require some changes to the rpm file format.
> 
> *Benefit*
> The benefit of zchunked rpms is that, when downloading an updated rpm,
> you would only need to download the chunks that have changed from
> what's on your system.
> 
> The uncompressed local chunks would be combined with the downloaded
> compressed chunks to create a local rpm that will pass signature
> verification without needing to recompress the uncompressed local
> chunks, making this computationally much faster than rebuilding a
> deltarpm, a win for users.
> 
> The savings wouldn't be as good as what deltarpm can achieve, but
> deltarpms would be redundant and could be removed, completely
> eliminating a large step from the compose process.
> 
> *Drawbacks*
>    1. Downloading a new release of a zchunked rpm would be larger than
>       downloading the equivalent deltarpm.  This is offset by the fact
>       that the client is able to work out which chunks it needs no matter
>       what the original rpm is, rather than needing a specific original
>       rpm as deltarpm does.
>    2. The rebuilt rpm may not be byte-for-byte identical to the original,
>       but will be able to be validated without decompression, as explained
>       in the next section
> 
> *Changes*
> The zchunk format would need to be extended to allow for a zchunked rpm
> to contain both the uncompressed chunks that were already on the local
> system and the newly downloaded compressed chunks while still passing
> signature verification.  This would also require moving signature
> verification to zchunk.
>  
> The rpm file format has to be changed because the zchunk header needs
> to be at the beginning of the file in order for the zchunk library
> figure out which chunks it needs to download.  My suggestions for
> changes to the rpm file format are as follows:
> 
>    1. Signing should be moved to the zchunk format as described at the
>       beginning of this section
>    2. The rpm header should be stored in one stream inside the zchunk
>       file.  This allows it to be easily extracted separately from the
>       data
>    3. The rpm cpio should be stored in a second stream inside the zchunk
>       file.
>    4. At minimum, an optional zchunk element should be set to identify
>       zchunk rpms as rpms rather than regular zchunk files.  If desired,
>       optional elements could also be set containing %{name}, %[version},
>       %{release}, %{arch} and %{epoch}.  This would allow this information
>       to be read easily without needing to extract the rpm header stream.
> 
> *Final notes*
> I realize this is a massive proposal, zchunk is still very young, and
> we're still working on getting the dnf zchunk pull requests reviewed. 
> I do think it's feasible and provides an opportunity to eliminate a
> pain point from our compose process while still reducing the download
> size for our users.
> 
> [1]: 
> https://fedoraproject.org/wiki/Objectives/Lifecycle/Problem_statements#Challenge_.231:_Faster.2C_more_scalable_composes
> [2]: https://github.com/zchunk/zchunk
> [3]: https://fedoraproject.org/wiki/Changes/Zchunk_Metadata
> [4]: https://github.com/zchunk/zchunk/blob/master/zchunk_format.txt
> _______________________________________________
> devel mailing list -- devel@lists.fedoraproject.org
> To unsubscribe send an email to devel-le...@lists.fedoraproject.org
> Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: 
> https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org
> 

Based on work I've done in this area, I'm somewhat sceptical that this
will work out well.  I spent a decent amount of time comparing various
approaches to data saving for both compressed data and regular binaries.
This was done a few months ago, so a few of the algos may have changed a
little since then, but I doubt the changes will be huge.

The various things I tested were, in no particular order:
 * xdelta3 (and a few similar VCDIFF based algos)
 * zstd 
 * zchunk
 * bsdiff
 * courgette
 * zucchini
And a few others which aren't interesting enough to mention here.

xdelta3 is old now, but was the standard binary delta compression tool
for a long time.  Of the better performing algos it has reasonably good 
generation time, and works reasonably well on compressed data (still a 
good way behind the top of the pile), but falls far behind on binaries. 

zstd was generally disappointing for both compressed and binary data.  It's
really a compression format, not a delta generator, and while it performed 
very well compared to traditional compression formats, it was unable to
compete with other tools here.

zchunk had basically the same problems as zstd,  except that the chunking
overhead made files even larger, and small symbol changes could mean that
a very large number of chunks needed to be transmitted.

bsdiff is a larve improvement over xdelta3 in terms of final size for
binary data.  It is considerably more expensive in terms of memory and cpu
to compute, but is fast to apply[0]. It works much less well on compressed
data, and really needs to work on a binary to be at its best.

courgette is excellent at reducing binary size, and still very good at
compressed data.  By data size alone it's the best of all of these, but is
somewhat complex and expensive, particularly in terms of memory.  An
over view of how it works is [1]

zucchini is an experimental delta generation tool for chromium.  I can't
see much that's been published externally about it, but I found it to 
compress almost as well as courgette but with greatly reduced memory use.
Code is very much a moving target, but is located at [2].




Overall in my tests zstd/schunk gave massively worse compression compared
to modern delta algorithms (courgette, zucchini), and the total time to
download, apply, install was always slowest for zstd based approaches.

So, then, the only time this would work is if the changed-chunk detection
feature of zchunk actually works effectively for RPMs.  Unfortunately,
when binaries change (and when RPMs as a whole change) it's not unusual
for this to manifest as adding/removing significant chunks of data from
near the beginning of the file, which means that the chunk matching fails
and you end up with a huge and slow download.

If you care only about making the 50th %ile case better, then zchunk is
probably at least not any worse than what we have now.  However this
will, I believe, come at the cost of making the 95th %ile case pretty
unpleasant for end users.

I think it'd be far better to explore Howard's proposal, for per-file
delta generation (as opposed to per-RPM), and use modern delta
generation (probably courgette) to compute the delta.



0: http://www.daemonology.net/bsdiff/
1: 
http://dev.chromium.org/developers/design-documents/software-updates-courgette
2: https://github.com/chromium/chromium/tree/master/components/zucchini
_______________________________________________
devel mailing list -- devel@lists.fedoraproject.org
To unsubscribe send an email to devel-le...@lists.fedoraproject.org
Fedora Code of Conduct: https://getfedora.org/code-of-conduct.html
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.org

Reply via email to